Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-25 Thread Akhil Das
You hit block not found issues when you processing time exceeds the batch
duration (this happens with receiver oriented streaming). If you are
consuming messages from Kafka then try to use the directStream or you can
also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer.
(This might slow things down a bit though).

Thanks
Best Regards

On Wed, Aug 19, 2015 at 8:21 PM, jlg jgri...@adzerk.com wrote:

 Some background on what we're trying to do:

 We have four Kinesis receivers with varying amounts of data coming through
 them. Ultimately we work on a unioned stream that is getting about 11
 MB/second of data. We use a batch size of 5 seconds.

 We create four distinct DStreams from this data that have different
 aggregation computations (various combinations of
 map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
 records to JSON strings and writing them to S3). We want to do 30 minute
 windows of computations on this data, to get a better compression rate for
 the aggregates (there are a lot of repeated keys across this time frame,
 and
 we want to combine them all -- we do this using reduceByKeyAndWindow).

 But even when trying to do 5 minute windows, we have issues with Could not
 compute split, block —— not found. This is being run on a YARN cluster and
 it seems like the executors are getting killed even though they should have
 plenty of memory.

 Also, it seems like no computation actually takes place until the end of
 the
 window duration. This seems inefficient if there is a lot of data that you
 know is going to be needed for the computation. Is there any good way
 around
 this?

 There are some of the configuration settings we are using for Spark:

 spark.executor.memory=26000M,\
 spark.executor.cores=4,\
 spark.executor.instances=5,\
 spark.driver.cores=4,\
 spark.driver.memory=24000M,\
 spark.default.parallelism=128,\
 spark.streaming.blockInterval=100ms,\
 spark.streaming.receiver.maxRate=2,\
 spark.akka.timeout=300,\
 spark.storage.memoryFraction=0.6,\
 spark.rdd.compress=true,\
 spark.executor.instances=16,\
 spark.serializer=org.apache.spark.serializer.KryoSerializer,\
 spark.kryoserializer.buffer.max=2047m,\


 Is this the correct way to do this, and how can I further debug to figure
 out this issue?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-19 Thread jlg
Some background on what we're trying to do:

We have four Kinesis receivers with varying amounts of data coming through
them. Ultimately we work on a unioned stream that is getting about 11
MB/second of data. We use a batch size of 5 seconds. 

We create four distinct DStreams from this data that have different
aggregation computations (various combinations of
map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
records to JSON strings and writing them to S3). We want to do 30 minute
windows of computations on this data, to get a better compression rate for
the aggregates (there are a lot of repeated keys across this time frame, and
we want to combine them all -- we do this using reduceByKeyAndWindow). 

But even when trying to do 5 minute windows, we have issues with Could not
compute split, block —— not found. This is being run on a YARN cluster and
it seems like the executors are getting killed even though they should have
plenty of memory. 

Also, it seems like no computation actually takes place until the end of the
window duration. This seems inefficient if there is a lot of data that you
know is going to be needed for the computation. Is there any good way around
this?

There are some of the configuration settings we are using for Spark:

spark.executor.memory=26000M,\
spark.executor.cores=4,\
spark.executor.instances=5,\
spark.driver.cores=4,\
spark.driver.memory=24000M,\
spark.default.parallelism=128,\
spark.streaming.blockInterval=100ms,\
spark.streaming.receiver.maxRate=2,\
spark.akka.timeout=300,\
spark.storage.memoryFraction=0.6,\
spark.rdd.compress=true,\
spark.executor.instances=16,\
spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark.kryoserializer.buffer.max=2047m,\


Is this the correct way to do this, and how can I further debug to figure
out this issue? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-13 Thread Saiph Kappa
Whether I use 1 or 2 machines, the results are the same... Here follows the
results I got using 1 and 2 receivers with 2 machines:

2 machines, 1 receiver:

sbt/sbt run-main Benchmark 1 machine1  1000 21 | grep -i Total
delay\|record

15/04/13 16:41:34 INFO JobScheduler: Total delay: 0.156 s for time
1428939694000 ms (execution: 0.142 s)
Received 92910 records
15/04/13 16:41:35 INFO JobScheduler: Total delay: 0.155 s for time
1428939695000 ms (execution: 0.142 s)
Received 92910 records
15/04/13 16:41:36 INFO JobScheduler: Total delay: 0.132 s for time
1428939696000 ms (execution: 0.119 s)
Received 92910 records
15/04/13 16:41:37 INFO JobScheduler: Total delay: 0.172 s for time
1428939697000 ms (execution: 0.161 s)
Received 92910 records
15/04/13 16:41:38 INFO JobScheduler: Total delay: 0.152 s for time
1428939698000 ms (execution: 0.140 s)
Received 92910 records
15/04/13 16:41:39 INFO JobScheduler: Total delay: 0.162 s for time
1428939699000 ms (execution: 0.149 s)
Received 92910 records
15/04/13 16:41:40 INFO JobScheduler: Total delay: 0.156 s for time
142893970 ms (execution: 0.143 s)
Received 92910 records
15/04/13 16:41:41 INFO JobScheduler: Total delay: 0.148 s for time
1428939701000 ms (execution: 0.135 s)
Received 92910 records
15/04/13 16:41:42 INFO JobScheduler: Total delay: 0.149 s for time
1428939702000 ms (execution: 0.135 s)
Received 92910 records
15/04/13 16:41:43 INFO JobScheduler: Total delay: 0.153 s for time
1428939703000 ms (execution: 0.136 s)
Received 92910 records
15/04/13 16:41:44 INFO JobScheduler: Total delay: 0.118 s for time
1428939704000 ms (execution: 0.111 s)
Received 92910 records
15/04/13 16:41:45 INFO JobScheduler: Total delay: 0.155 s for time
1428939705000 ms (execution: 0.143 s)
Received 92910 records
15/04/13 16:41:46 INFO JobScheduler: Total delay: 0.138 s for time
1428939706000 ms (execution: 0.126 s)
Received 92910 records
15/04/13 16:41:47 INFO JobScheduler: Total delay: 0.154 s for time
1428939707000 ms (execution: 0.142 s)
Received 92910 records
15/04/13 16:41:48 INFO JobScheduler: Total delay: 0.172 s for time
1428939708000 ms (execution: 0.160 s)
Received 92910 records
15/04/13 16:41:49 INFO JobScheduler: Total delay: 0.144 s for time
1428939709000 ms (execution: 0.133 s)


Receiver Statistics

   - Receiver


   - Status


   - Location


   - Records in last batch
   - [2015/04/13 16:53:54]


   - Minimum rate
   - [records/sec]


   - Median rate
   - [records/sec]


   - Maximum rate
   - [records/sec]


   - Last Error

Receiver-0---10-10-100-
2 machines, 2 receivers:

sbt/sbt run-main Benchmark 2 machine1  1000 21 | grep -i Total
delay\|record

Received 92910 records
15/04/13 16:43:13 INFO JobScheduler: Total delay: 0.153 s for time
1428939793000 ms (execution: 0.142 s)
Received 92910 records
15/04/13 16:43:14 INFO JobScheduler: Total delay: 0.144 s for time
1428939794000 ms (execution: 0.136 s)
Received 92910 records
15/04/13 16:43:15 INFO JobScheduler: Total delay: 0.145 s for time
1428939795000 ms (execution: 0.132 s)
Received 92910 records
15/04/13 16:43:16 INFO JobScheduler: Total delay: 0.144 s for time
1428939796000 ms (execution: 0.134 s)
Received 92910 records
15/04/13 16:43:17 INFO JobScheduler: Total delay: 0.148 s for time
1428939797000 ms (execution: 0.142 s)
Received 92910 records
15/04/13 16:43:18 INFO JobScheduler: Total delay: 0.136 s for time
1428939798000 ms (execution: 0.123 s)
Received 92910 records
15/04/13 16:43:19 INFO JobScheduler: Total delay: 0.155 s for time
1428939799000 ms (execution: 0.145 s)
Received 92910 records
15/04/13 16:43:20 INFO JobScheduler: Total delay: 0.160 s for time
142893980 ms (execution: 0.152 s)
Received 83619 records
15/04/13 16:43:21 INFO JobScheduler: Total delay: 0.141 s for time
1428939801000 ms (execution: 0.131 s)
Received 102201 records
15/04/13 16:43:22 INFO JobScheduler: Total delay: 0.208 s for time
1428939802000 ms (execution: 0.197 s)
Received 83619 records
15/04/13 16:43:23 INFO JobScheduler: Total delay: 0.160 s for time
1428939803000 ms (execution: 0.147 s)
Received 92910 records
15/04/13 16:43:24 INFO JobScheduler: Total delay: 0.197 s for time
1428939804000 ms (execution: 0.185 s)
Received 92910 records
15/04/13 16:43:25 INFO JobScheduler: Total delay: 0.200 s for time
1428939805000 ms (execution: 0.189 s)
Received 92910 records
15/04/13 16:43:26 INFO JobScheduler: Total delay: 0.181 s for time
1428939806000 ms (execution: 0.173 s)
Received 92910 records
15/04/13 16:43:27 INFO JobScheduler: Total delay: 0.189 s for time
1428939807000 ms (execution: 0.178 s)

Receiver Statistics

   - Receiver


   - Status


   - Location


   - Records in last batch
   - [2015/04/13 16:49:36]


   - Minimum rate
   - [records/sec]


   - Median rate
   - [records/sec]


   - Maximum rate
   - [records/sec]


   - Last Error

Receiver-0---10-10-10-9-Receiver-1---

On Thu, Apr 9, 2015 at 7:55 PM, Tathagata Das t...@databricks.com wrote:

 Are you running # of receivers = # machines?

 

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Saiph Kappa
Sorry, I was getting those errors because my workload was not sustainable.

However, I noticed that, by just running the spark-streaming-benchmark (
https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala
), I get no difference on the execution time, number of processed records,
and delay whether I'm using 1 machine or 2 machines with the setup
described before (using spark standalone). Is it normal?



On Fri, Mar 27, 2015 at 5:32 PM, Tathagata Das t...@databricks.com wrote:

 If it is deterministically reproducible, could you generate full DEBUG
 level logs, from the driver and the workers and give it to me? Basically I
 want to trace through what is happening to the block that is not being
 found.
 And can you tell what Cluster manager are you using? Spark Standalone,
 Mesos or YARN?


 On Fri, Mar 27, 2015 at 10:09 AM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 Hi,

 I am just running this simple example with
 machineA: 1 master + 1 worker
 machineB: 1 worker
 «
 val ssc = new StreamingContext(sparkConf, Duration(1000))

 val rawStreams = (1 to numStreams).map(_
 =ssc.rawSocketStream[String](host, port,
 StorageLevel.MEMORY_ONLY_SER)).toArray
 val union = ssc.union(rawStreams)

 union.filter(line = Random.nextInt(1) == 0).map(line = {
   var sum = BigInt(0)
   line.toCharArray.foreach(chr = sum += chr.toInt)
   fib2(sum)
   sum
 }).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s = s### result:
 $s).print()
 »

 And I'm getting the following exceptions:

 Log from machineB
 «
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 132
 15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 134
 15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
 15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast
 variable 24
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 136
 15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 138
 15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 140
 15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
 curMem=47117, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
 bytes in memory (estimated size 1886.0 B, free 267.2 MB)
 15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
 broadcast_24_piece0
 15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24
 took 19 ms
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
 curMem=49003, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values
 in memory (estimated size 3.0 KB, free 267.2 MB)
 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0
 (TID 140)
 java.lang.Exception: Could not compute split, block input-0-1427473262420
 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 15/03/27 16:21:35 ERROR Executor: Exception in task 6.0 in stage 27.0
 (TID 138)
 java.lang.Exception: Could not compute split, block input-0-1427473262418
 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at 

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
Are you running # of receivers = # machines?

TD

On Thu, Apr 9, 2015 at 9:56 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

 Sorry, I was getting those errors because my workload was not sustainable.

 However, I noticed that, by just running the spark-streaming-benchmark (
 https://github.com/tdas/spark-streaming-benchmark/blob/master/Benchmark.scala
 ), I get no difference on the execution time, number of processed records,
 and delay whether I'm using 1 machine or 2 machines with the setup
 described before (using spark standalone). Is it normal?



 On Fri, Mar 27, 2015 at 5:32 PM, Tathagata Das t...@databricks.com
 wrote:

 If it is deterministically reproducible, could you generate full DEBUG
 level logs, from the driver and the workers and give it to me? Basically I
 want to trace through what is happening to the block that is not being
 found.
 And can you tell what Cluster manager are you using? Spark Standalone,
 Mesos or YARN?


 On Fri, Mar 27, 2015 at 10:09 AM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 Hi,

 I am just running this simple example with
 machineA: 1 master + 1 worker
 machineB: 1 worker
 «
 val ssc = new StreamingContext(sparkConf, Duration(1000))

 val rawStreams = (1 to numStreams).map(_
 =ssc.rawSocketStream[String](host, port,
 StorageLevel.MEMORY_ONLY_SER)).toArray
 val union = ssc.union(rawStreams)

 union.filter(line = Random.nextInt(1) == 0).map(line = {
   var sum = BigInt(0)
   line.toCharArray.foreach(chr = sum += chr.toInt)
   fib2(sum)
   sum
 }).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s = s### result:
 $s).print()
 »

 And I'm getting the following exceptions:

 Log from machineB
 «
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
 132
 15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
 134
 15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
 15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast
 variable 24
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
 136
 15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
 138
 15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task
 140
 15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
 curMem=47117, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
 bytes in memory (estimated size 1886.0 B, free 267.2 MB)
 15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
 broadcast_24_piece0
 15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24
 took 19 ms
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
 curMem=49003, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values
 in memory (estimated size 3.0 KB, free 267.2 MB)
 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0
 (TID 140)
 java.lang.Exception: Could not compute split, block
 input-0-1427473262420 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 15/03/27 16:21:35 ERROR Executor: Exception in task 6.0 in stage 27.0
 (TID 138)
 java.lang.Exception: Could not compute split, block
 input-0-1427473262418 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at 

Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Saiph Kappa
Hi,

I am just running this simple example with
machineA: 1 master + 1 worker
machineB: 1 worker
«
val ssc = new StreamingContext(sparkConf, Duration(1000))

val rawStreams = (1 to numStreams).map(_
=ssc.rawSocketStream[String](host, port,
StorageLevel.MEMORY_ONLY_SER)).toArray
val union = ssc.union(rawStreams)

union.filter(line = Random.nextInt(1) == 0).map(line = {
  var sum = BigInt(0)
  line.toCharArray.foreach(chr = sum += chr.toInt)
  fib2(sum)
  sum
}).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s = s### result:
$s).print()
»

And I'm getting the following exceptions:

Log from machineB
«
15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 132
15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 134
15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast variable
24
15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 136
15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 138
15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 140
15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
curMem=47117, maxMem=280248975
15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
bytes in memory (estimated size 1886.0 B, free 267.2 MB)
15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
broadcast_24_piece0
15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24 took
19 ms
15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
curMem=49003, maxMem=280248975
15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values in
memory (estimated size 3.0 KB, free 267.2 MB)
15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID
140)
java.lang.Exception: Could not compute split, block input-0-1427473262420
not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
15/03/27 16:21:35 ERROR Executor: Exception in task 6.0 in stage 27.0 (TID
138)
java.lang.Exception: Could not compute split, block input-0-1427473262418
not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das
If it is deterministically reproducible, could you generate full DEBUG
level logs, from the driver and the workers and give it to me? Basically I
want to trace through what is happening to the block that is not being
found.
And can you tell what Cluster manager are you using? Spark Standalone,
Mesos or YARN?


On Fri, Mar 27, 2015 at 10:09 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

 Hi,

 I am just running this simple example with
 machineA: 1 master + 1 worker
 machineB: 1 worker
 «
 val ssc = new StreamingContext(sparkConf, Duration(1000))

 val rawStreams = (1 to numStreams).map(_
 =ssc.rawSocketStream[String](host, port,
 StorageLevel.MEMORY_ONLY_SER)).toArray
 val union = ssc.union(rawStreams)

 union.filter(line = Random.nextInt(1) == 0).map(line = {
   var sum = BigInt(0)
   line.toCharArray.foreach(chr = sum += chr.toInt)
   fib2(sum)
   sum
 }).reduceByWindow(_+_, Seconds(1),Seconds(1)).map(s = s### result:
 $s).print()
 »

 And I'm getting the following exceptions:

 Log from machineB
 «
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 132
 15/03/27 16:21:35 INFO Executor: Running task 0.0 in stage 27.0 (TID 132)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 134
 15/03/27 16:21:35 INFO Executor: Running task 2.0 in stage 27.0 (TID 134)
 15/03/27 16:21:35 INFO TorrentBroadcast: Started reading broadcast
 variable 24
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 136
 15/03/27 16:21:35 INFO Executor: Running task 4.0 in stage 27.0 (TID 136)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 138
 15/03/27 16:21:35 INFO Executor: Running task 6.0 in stage 27.0 (TID 138)
 15/03/27 16:21:35 INFO CoarseGrainedExecutorBackend: Got assigned task 140
 15/03/27 16:21:35 INFO Executor: Running task 8.0 in stage 27.0 (TID 140)
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(1886) called with
 curMem=47117, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24_piece0 stored as
 bytes in memory (estimated size 1886.0 B, free 267.2 MB)
 15/03/27 16:21:35 INFO BlockManagerMaster: Updated info of block
 broadcast_24_piece0
 15/03/27 16:21:35 INFO TorrentBroadcast: Reading broadcast variable 24
 took 19 ms
 15/03/27 16:21:35 INFO MemoryStore: ensureFreeSpace(3104) called with
 curMem=49003, maxMem=280248975
 15/03/27 16:21:35 INFO MemoryStore: Block broadcast_24 stored as values in
 memory (estimated size 3.0 KB, free 267.2 MB)
 15/03/27 16:21:35 ERROR Executor: Exception in task 8.0 in stage 27.0 (TID
 140)
 java.lang.Exception: Could not compute split, block input-0-1427473262420
 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)
 15/03/27 16:21:35 ERROR Executor: Exception in task 6.0 in stage 27.0 (TID
 138)
 java.lang.Exception: Could not compute split, block input-0-1427473262418
 not found
 at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at 

Re: Spark Streaming : Could not compute split, block not found

2014-10-09 Thread Tian Zhang
I have figured out why I am getting this error:
We have a lot of data in kafka and the DStream from Kafka used
MEMROY_ONLY_SER,
so once the memory is low, spark started to discard data that is needed
later ...
So once I change to MEMORY_AND_DISK_SER, the error is gone.

Tian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p16084.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming : Could not compute split, block not found

2014-08-04 Thread Tathagata Das
Aaah sorry, I should have been more clear. Can you give me INFO (DEBUG
even better) level logs since the start of the program? I need to see
how the cleaning up code is managing to delete the block.

TD

On Fri, Aug 1, 2014 at 10:26 PM, Kanwaldeep kanwal...@gmail.com wrote:
 Here is the log file.
 streaming.gz
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n11240/streaming.gz

 There are quite few AskTimeouts that have happening for about 2 minutes and
 then followed by block not found errors.

 Thanks
 Kanwal




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11240.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
We are using Sparks 1.0 for Spark Streaming on Spark Standalone cluster and
seeing the following error.

Job aborted due to stage failure: Task 3475.0:15 failed 4 times, most
recent failure: Exception failure in TID 216394 on host
hslave33102.sjc9.service-now.com: java.lang.Exception: Could not compute
split, block input-0-140686934 not found
org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) 

We are using the Memory_DISK serialization option for the input streams. And
the stream is also being persisted since we have multiple transformations
happening on the input stream.


val lines = KafkaUtils.createStream[String, Array[Byte], StringDecoder,
DefaultDecoder](ssc, kafkaParams, topicpMap,
StorageLevel.MEMORY_AND_DISK_SER)

lines.persist(StorageLevel.MEMORY_AND_DISK_SER)

We are aggregating data every 15 minutes as well as an hour. The
spark.streaming.blockInterval=1 so we minimize the blocks of data read.

The problem started at the 15 minute interval but now I'm seeing it happen
every hour since last night.

Any suggestions?

Thanks
Kanwal



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
Are you accessing the RDDs on raw data blocks and running independent
Spark jobs on them (that is outside DStream)? In that case this may
happen as Spark Straming will clean up the raw data based on the
DStream operations (if there is a window op of 15 mins, it will keep
the data around for 15 mins at least). So independent Spark jobs that
access old data may fail. The solution for that is using
DStream.remember() on the raw input stream to make sure the data is
kept around.

Not sure if this was the problem or not. For more info can you tell
when you are running Spark 0.9 or 1.0?



TD

On Fri, Aug 1, 2014 at 10:55 AM, Kanwaldeep kanwal...@gmail.com wrote:
 We are using Sparks 1.0 for Spark Streaming on Spark Standalone cluster and
 seeing the following error.

 Job aborted due to stage failure: Task 3475.0:15 failed 4 times, most
 recent failure: Exception failure in TID 216394 on host
 hslave33102.sjc9.service-now.com: java.lang.Exception: Could not compute
 split, block input-0-140686934 not found
 org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)

 We are using the Memory_DISK serialization option for the input streams. And
 the stream is also being persisted since we have multiple transformations
 happening on the input stream.


 val lines = KafkaUtils.createStream[String, Array[Byte], StringDecoder,
 DefaultDecoder](ssc, kafkaParams, topicpMap,
 StorageLevel.MEMORY_AND_DISK_SER)

 lines.persist(StorageLevel.MEMORY_AND_DISK_SER)

 We are aggregating data every 15 minutes as well as an hour. The
 spark.streaming.blockInterval=1 so we minimize the blocks of data read.

 The problem started at the 15 minute interval but now I'm seeing it happen
 every hour since last night.

 Any suggestions?

 Thanks
 Kanwal



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
We are using Sparks 1.0.

I'm using DStream operations such as map, filter and reduceByKeyAndWindow
and doing a foreach operation on DStream. 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
All the operations being done are using the dstream. I do read an RDD in
memory which is collected and converted into a map and used for lookups as
part of DStream operations. This RDD is loaded only once and converted into
map that is then used on streamed data.

Do you mean non streaming jobs on RDD using raw kafka data? 

Log File attached:
streaming.gz
http://apache-spark-user-list.1001560.n3.nabble.com/file/n11229/streaming.gz  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11229.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
I meant are you using RDD generated by DStreams, in Spark jobs out
side the DStreams computation?
Something like this:



var globalRDD = null

dstream.foreachRDD(rdd =
   // have a global pointer based on the rdds generate by dstream
if (runningFirstTime) globalRDD = rdd
)
ssc.start()
.

// much much time later try to use the RDD in Spark jobs independent
of the streaming computation
globalRDD.count()










On Fri, Aug 1, 2014 at 3:52 PM, Kanwaldeep kanwal...@gmail.com wrote:
 All the operations being done are using the dstream. I do read an RDD in
 memory which is collected and converted into a map and used for lookups as
 part of DStream operations. This RDD is loaded only once and converted into
 map that is then used on streamed data.

 Do you mean non streaming jobs on RDD using raw kafka data?

 Log File attached:
 streaming.gz
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n11229/streaming.gz



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11229.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
Not at all. Don't have any such code.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Tathagata Das
Then could you try giving me a log.
And as a workaround, disable spark.streaming.unpersist = false

On Fri, Aug 1, 2014 at 4:10 PM, Kanwaldeep kanwal...@gmail.com wrote:
 Not at all. Don't have any such code.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11231.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Streaming : Could not compute split, block not found

2014-08-01 Thread Kanwaldeep
Here is the log file.
streaming.gz
http://apache-spark-user-list.1001560.n3.nabble.com/file/n11240/streaming.gz  

There are quite few AskTimeouts that have happening for about 2 minutes and
then followed by block not found errors.

Thanks
Kanwal




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Could not compute split, block not found

2014-07-01 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input
streams?

TD


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 








Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tobias,

Your explanation makes a lot of sense. Actually, I tried to use partial
data on the same program yesterday. It has been up for around 24 hours and
is still running correctly. Thanks!

Bill


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 








Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tathagata,

Yes. The input stream is from Kafka and my program reads the data, keeps
all the data in memory, process the data, and generate the output.

Bill


On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Are you by any change using only memory in the storage level of the input
 streams?

 TD


 On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 









Re: Could not compute split, block not found

2014-06-30 Thread Bill Jay
Tobias,

Your suggestion is very helpful. I will definitely investigate it.

Just curious. Suppose the batch size is t seconds. In practice, does Spark
always require the program to finish processing the data of t seconds
within t seconds' processing time? Can Spark begin to consume the new batch
before finishing processing the next batch? If Spark can do them together,
it may save the processing time and solve the problem of data piling up.

Thanks!

Bill




On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data so
 that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile up.
 Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size. It
 ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 






Re: Could not compute split, block not found

2014-06-30 Thread Tobias Pfeiffer
Bill,

let's say the processing time is t' and the window size t. Spark does not
*require* t'  t. In fact, for *temporary* peaks in your streaming data, I
think the way Spark handles it is very nice, in particular since 1) it does
not mix up the order in which items arrived in the stream, so items from a
later window will always be processed later, and 2) because an increase in
data will not be punished with high load and unresponsive systems, but with
disk space consumption instead.

However, if all of your windows require t'  t processing time (and it's
not because you are waiting, but because you actually do some computation),
then you are in bad luck, because if you start processing the next window
while the previous one is still processed, you have less resources for each
and processing will take even longer. However, if you are only waiting
(e.g., for network I/O), then maybe you can employ some asynchronous
solution where your tasks return immediately and deliver their result via a
callback later?

Tobias



On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does Spark
 always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data so
 that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile up.
 Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 







Re: Could not compute split, block not found

2014-06-29 Thread Bill Jay
Tobias,

Thanks for your help. I think in my case, the batch size is 1 minute.
However, it takes my program more than 1 minute to process 1 minute's data.
I am not sure whether it is because the unprocessed data pile up. Do you
have an suggestion on how to check it and solve it? Thanks!

Bill


On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size. It
 ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 



Could not compute split, block not found

2014-06-27 Thread Bill Jay
Hi,

I am running a spark streaming job with 1 minute as the batch size. It ran
around 84 minutes and was killed because of the exception with the
following information:

*java.lang.Exception: Could not compute split, block input-0-1403893740400
not found*


Before it was killed, it was able to correctly generate output for each
batch.

Any help on this will be greatly appreciated.

Bill