RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Another quick question... I've got 4 nodes with 2 cores each. I've assinged the 
streaming app 4 cores. It seems to be using one per node. I imagine forwarding 
from the receivers to the executors are causing unnecessary processing. Is 
there a way to specify that I want 2 cores from the same machines to be 
involved (even better if this can be specified during spark-submit)?

Thanks,
Ashic.

From: as...@live.com
To: gerard.m...@gmail.com; asudipta.baner...@gmail.com
CC: user@spark.apache.org; tathagata.das1...@gmail.com
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 15:40:17 +




Yup...looks like it. I can do some tricks to reduce setup costs further, but 
this is much better than where I was yesterday. Thanks for your awesome input :)

-Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 16:34:38 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: asudipta.baner...@gmail.com
CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com

Given that the process, and in particular, the setup of connections, is bound 
to the number of partitions (in x.foreachPartition{ x=> ???}), I think it would 
be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can read 
the tuning details here:  http://www.virdata.com/tuning-spark/#Partitions)
-kr, Gerard.
On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas  wrote:
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 
seconds. I think there's evidence that setup costs are quite high in this case 
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee  
wrote:
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the "heavy" work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time > batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreci

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Yup...looks like it. I can do some tricks to reduce setup costs further, but 
this is much better than where I was yesterday. Thanks for your awesome input :)

-Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 16:34:38 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: asudipta.baner...@gmail.com
CC: as...@live.com; user@spark.apache.org; tathagata.das1...@gmail.com

Given that the process, and in particular, the setup of connections, is bound 
to the number of partitions (in x.foreachPartition{ x=> ???}), I think it would 
be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can read 
the tuning details here:  http://www.virdata.com/tuning-spark/#Partitions)
-kr, Gerard.
On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas  wrote:
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761 
seconds. I think there's evidence that setup costs are quite high in this case 
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee  
wrote:
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the "heavy" work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time > batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingSta

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
Given that the process, and in particular, the setup of connections, is
bound to the number of partitions (in x.foreachPartition{ x=> ???}), I
think it would be worth trying reducing them.
Increasing the  'spark.streaming.BlockInterval' will do the trick (you can
read the tuning details here:
http://www.virdata.com/tuning-spark/#Partitions)

-kr, Gerard.

On Thu, Jan 22, 2015 at 4:28 PM, Gerard Maas  wrote:

> So the system has gone from 7msg in 4.961 secs (median) to 106msgs in
> 4,761 seconds.
> I think there's evidence that setup costs are quite high in this case and
> increasing the batch interval is helping.
>
> On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee <
> asudipta.baner...@gmail.com> wrote:
>
>> Hi Ashic Mahtab,
>>
>> The Cassandra and the Zookeeper are they installed as a part of Yarn
>> architecture or are they installed in a separate layer with Apache Spark .
>>
>> Thanks and Regards,
>> Sudipta
>>
>> On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:
>>
>>> Hi Guys,
>>> So I changed the interval to 15 seconds. There's obviously a lot more
>>> messages per batch, but (I think) it looks a lot healthier. Can you see any
>>> major warning signs? I think that with 2 second intervals, the setup /
>>> teardown per partition was what was causing the delays.
>>>
>>> Streaming
>>>
>>>- *Started at: *Thu Jan 22 13:23:12 GMT 2015
>>>- *Time since start: *1 hour 17 minutes 16 seconds
>>>- *Network receivers: *2
>>>- *Batch interval: *15 seconds
>>>- *Processed batches: *309
>>>- *Waiting batches: *0
>>>
>>>
>>>
>>> Statistics over last 100 processed batchesReceiver Statistics
>>>
>>>- Receiver
>>>
>>>
>>>- Status
>>>
>>>
>>>- Location
>>>
>>>
>>>- Records in last batch
>>>- [2015/01/22 14:40:29]
>>>
>>>
>>>- Minimum rate
>>>- [records/sec]
>>>
>>>
>>>- Median rate
>>>- [records/sec]
>>>
>>>
>>>- Maximum rate
>>>- [records/sec]
>>>
>>>
>>>- Last Error
>>>
>>> RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
>>> VDCAPP50.bar.local2.6 K29107291-
>>> Batch Processing Statistics
>>>
>>>MetricLast batchMinimum25th percentileMedian75th 
>>> percentileMaximumProcessing
>>>Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
>>>seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9
>>>msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4
>>>seconds 764 ms4 seconds 792 ms5 seconds 809 ms
>>>
>>>
>>> Regards,
>>> Ashic.
>>> --
>>> From: as...@live.com
>>> To: gerard.m...@gmail.com
>>> CC: user@spark.apache.org
>>> Subject: RE: Are these numbers abnormal for spark streaming?
>>> Date: Thu, 22 Jan 2015 12:32:05 +
>>>
>>>
>>> Hi Gerard,
>>> Thanks for the response.
>>>
>>> The messages get desrialised from msgpack format, and one of the strings
>>> is desrialised to json. Certain fields are checked to decide if further
>>> processing is required. If so, it goes through a series of in mem filters
>>> to check if more processing is required. If so, only then does the "heavy"
>>> work start. That consists of a few db queries, and potential updates to the
>>> db + message on message queue. The majority of messages don't need
>>> processing. The messages needing processing at peak are about three every
>>> other second.
>>>
>>> One possible things that might be happening is the session
>>> initialisation and prepared statement initialisation for each partition. I
>>> can resort to some tricks, but I think I'll try increasing batch interval
>>> to 15 seconds. I'll report back with findings.
>>>
>>> Thanks,
>>> Ashic.
>>>
>>> --
>>> From: gerard.m...@gmail.com
>>> Date: Thu, 22 Jan 2015 12:30:08 +0100
>>> Subject: Re: Are these numbers abnormal for spark streaming?
>>> To: tathagata.das1...@gmail.com
>>> CC: as...@live.com; t...@databricks.com; user@spark.apache.org
>>>
>>> and post the code (if possible).
>>> In 

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
So the system has gone from 7msg in 4.961 secs (median) to 106msgs in 4,761
seconds.
I think there's evidence that setup costs are quite high in this case and
increasing the batch interval is helping.

On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee <
asudipta.baner...@gmail.com> wrote:

> Hi Ashic Mahtab,
>
> The Cassandra and the Zookeeper are they installed as a part of Yarn
> architecture or are they installed in a separate layer with Apache Spark .
>
> Thanks and Regards,
> Sudipta
>
> On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:
>
>> Hi Guys,
>> So I changed the interval to 15 seconds. There's obviously a lot more
>> messages per batch, but (I think) it looks a lot healthier. Can you see any
>> major warning signs? I think that with 2 second intervals, the setup /
>> teardown per partition was what was causing the delays.
>>
>> Streaming
>>
>>- *Started at: *Thu Jan 22 13:23:12 GMT 2015
>>- *Time since start: *1 hour 17 minutes 16 seconds
>>- *Network receivers: *2
>>- *Batch interval: *15 seconds
>>- *Processed batches: *309
>>- *Waiting batches: *0
>>
>>
>>
>> Statistics over last 100 processed batchesReceiver Statistics
>>
>>- Receiver
>>
>>
>>- Status
>>
>>
>>- Location
>>
>>
>>- Records in last batch
>>- [2015/01/22 14:40:29]
>>
>>
>>- Minimum rate
>>- [records/sec]
>>
>>
>>- Median rate
>>- [records/sec]
>>
>>
>>- Maximum rate
>>- [records/sec]
>>
>>
>>- Last Error
>>
>> RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
>> VDCAPP50.bar.local2.6 K29107291-
>> Batch Processing Statistics
>>
>>MetricLast batchMinimum25th percentileMedian75th 
>> percentileMaximumProcessing
>>Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
>>seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 
>> msTotal
>>Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4
>>seconds 792 ms5 seconds 809 ms
>>
>>
>> Regards,
>> Ashic.
>> --
>> From: as...@live.com
>> To: gerard.m...@gmail.com
>> CC: user@spark.apache.org
>> Subject: RE: Are these numbers abnormal for spark streaming?
>> Date: Thu, 22 Jan 2015 12:32:05 +
>>
>>
>> Hi Gerard,
>> Thanks for the response.
>>
>> The messages get desrialised from msgpack format, and one of the strings
>> is desrialised to json. Certain fields are checked to decide if further
>> processing is required. If so, it goes through a series of in mem filters
>> to check if more processing is required. If so, only then does the "heavy"
>> work start. That consists of a few db queries, and potential updates to the
>> db + message on message queue. The majority of messages don't need
>> processing. The messages needing processing at peak are about three every
>> other second.
>>
>> One possible things that might be happening is the session initialisation
>> and prepared statement initialisation for each partition. I can resort to
>> some tricks, but I think I'll try increasing batch interval to 15 seconds.
>> I'll report back with findings.
>>
>> Thanks,
>> Ashic.
>>
>> --
>> From: gerard.m...@gmail.com
>> Date: Thu, 22 Jan 2015 12:30:08 +0100
>> Subject: Re: Are these numbers abnormal for spark streaming?
>> To: tathagata.das1...@gmail.com
>> CC: as...@live.com; t...@databricks.com; user@spark.apache.org
>>
>> and post the code (if possible).
>> In a nutshell, your processing time > batch interval,  resulting in an
>> ever-increasing delay that will end up in a crash.
>> 3 secs to process 14 messages looks like a lot. Curious what the job
>> logic is.
>>
>> -kr, Gerard.
>>
>> On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> This is not normal. Its a huge scheduling delay!! Can you tell me more
>> about the application?
>> - cluser setup, number of receivers, whats the computation, etc.
>>
>> On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:
>>
>> Hate to do this...but...erm...bump? Would really appreciate input from
>> others using Streaming. Or at least some docs that would tell me if these
>> are expected or not.
>>
>> 

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Sudipta,
Standalone spark master. Separate Zookeeper cluster. 4 worker nodes with 
cassandra + spark on each. No hadoop / hdfs / yarn.

Regards,
Ashic.

Date: Thu, 22 Jan 2015 20:42:43 +0530
Subject: Re: Are these numbers abnormal for spark streaming?
From: asudipta.baner...@gmail.com
To: as...@live.com
CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com

Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn 
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:



Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the "heavy" work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time > batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Sudipta Banerjee
Hi Ashic Mahtab,

The Cassandra and the Zookeeper are they installed as a part of Yarn
architecture or are they installed in a separate layer with Apache Spark .

Thanks and Regards,
Sudipta

On Thu, Jan 22, 2015 at 8:13 PM, Ashic Mahtab  wrote:

> Hi Guys,
> So I changed the interval to 15 seconds. There's obviously a lot more
> messages per batch, but (I think) it looks a lot healthier. Can you see any
> major warning signs? I think that with 2 second intervals, the setup /
> teardown per partition was what was causing the delays.
>
> Streaming
>
>- *Started at: *Thu Jan 22 13:23:12 GMT 2015
>- *Time since start: *1 hour 17 minutes 16 seconds
>- *Network receivers: *2
>- *Batch interval: *15 seconds
>- *Processed batches: *309
>- *Waiting batches: *0
>
>
>
> Statistics over last 100 processed batchesReceiver Statistics
>
>- Receiver
>
>
>- Status
>
>
>- Location
>
>
>- Records in last batch
>- [2015/01/22 14:40:29]
>
>
>- Minimum rate
>- [records/sec]
>
>
>- Median rate
>- [records/sec]
>
>
>- Maximum rate
>- [records/sec]
>
>
>- Last Error
>
> RmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 K29106295-RmqReceiver-1ACTIVE
> VDCAPP50.bar.local2.6 K29107291-
> Batch Processing Statistics
>
>MetricLast batchMinimum25th percentileMedian75th 
> percentileMaximumProcessing
>Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 ms4 seconds 761 ms4
>seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 ms4 ms9 msTotal
>Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 764 ms4
>seconds 792 ms5 seconds 809 ms
>
>
> Regards,
> Ashic.
> --
> From: as...@live.com
> To: gerard.m...@gmail.com
> CC: user@spark.apache.org
> Subject: RE: Are these numbers abnormal for spark streaming?
> Date: Thu, 22 Jan 2015 12:32:05 +
>
>
> Hi Gerard,
> Thanks for the response.
>
> The messages get desrialised from msgpack format, and one of the strings
> is desrialised to json. Certain fields are checked to decide if further
> processing is required. If so, it goes through a series of in mem filters
> to check if more processing is required. If so, only then does the "heavy"
> work start. That consists of a few db queries, and potential updates to the
> db + message on message queue. The majority of messages don't need
> processing. The messages needing processing at peak are about three every
> other second.
>
> One possible things that might be happening is the session initialisation
> and prepared statement initialisation for each partition. I can resort to
> some tricks, but I think I'll try increasing batch interval to 15 seconds.
> I'll report back with findings.
>
> Thanks,
> Ashic.
>
> --
> From: gerard.m...@gmail.com
> Date: Thu, 22 Jan 2015 12:30:08 +0100
> Subject: Re: Are these numbers abnormal for spark streaming?
> To: tathagata.das1...@gmail.com
> CC: as...@live.com; t...@databricks.com; user@spark.apache.org
>
> and post the code (if possible).
> In a nutshell, your processing time > batch interval,  resulting in an
> ever-increasing delay that will end up in a crash.
> 3 secs to process 14 messages looks like a lot. Curious what the job logic
> is.
>
> -kr, Gerard.
>
> On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> This is not normal. Its a huge scheduling delay!! Can you tell me more
> about the application?
> - cluser setup, number of receivers, whats the computation, etc.
>
> On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:
>
> Hate to do this...but...erm...bump? Would really appreciate input from
> others using Streaming. Or at least some docs that would tell me if these
> are expected or not.
>
> --
> From: as...@live.com
> To: user@spark.apache.org
> Subject: Are these numbers abnormal for spark streaming?
> Date: Wed, 21 Jan 2015 11:26:31 +
>
>
> Hi Guys,
> I've got Spark Streaming set up for a low data rate system (using spark's
> features for analysis, rather than high throughput). Messages are coming in
> throughout the day, at around 1-20 per second (finger in the air
> estimate...not analysed yet).  In the spark streaming UI for the
> application, I'm getting the following after 17 hours.
>
> Streaming
>
>- *Started at: *Tue Jan 20 16:58:43 GMT 2015
>- *Time since start: *18 hours 24 minutes 34 seconds
>- *Network receivers: *2
>- *Batch interval: *2 seconds
>- *Processed batches: *16482

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Guys,
So I changed the interval to 15 seconds. There's obviously a lot more messages 
per batch, but (I think) it looks a lot healthier. Can you see any major 
warning signs? I think that with 2 second intervals, the setup / teardown per 
partition was what was causing the delays.

StreamingStarted at: Thu Jan 22 13:23:12 GMT 2015Time since start: 1 hour 17 
minutes 16 secondsNetwork receivers: 2Batch interval: 15 secondsProcessed 
batches: 309Waiting batches: 0

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/22 
14:40:29]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEVDCAPP53.foo.local2.6 
K29106295-RmqReceiver-1ACTIVEVDCAPP50.bar.local2.6 K29107291-Batch Processing 
StatisticsMetricLast batchMinimum25th percentileMedian75th 
percentileMaximumProcessing Time4 seconds 812 ms4 seconds 698 ms4 seconds 738 
ms4 seconds 761 ms4 seconds 788 ms5 seconds 802 msScheduling Delay2 ms0 ms3 ms3 
ms4 ms9 msTotal Delay4 seconds 814 ms4 seconds 701 ms4 seconds 739 ms4 seconds 
764 ms4 seconds 792 ms5 seconds 809 ms
Regards,
Ashic.
From: as...@live.com
To: gerard.m...@gmail.com
CC: user@spark.apache.org
Subject: RE: Are these numbers abnormal for spark streaming?
Date: Thu, 22 Jan 2015 12:32:05 +




Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the "heavy" work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time > batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these "normal". I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th 

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi Gerard,
Thanks for the response.

The messages get desrialised from msgpack format, and one of the strings is 
desrialised to json. Certain fields are checked to decide if further processing 
is required. If so, it goes through a series of in mem filters to check if more 
processing is required. If so, only then does the "heavy" work start. That 
consists of a few db queries, and potential updates to the db + message on 
message queue. The majority of messages don't need processing. The messages 
needing processing at peak are about three every other second. 

One possible things that might be happening is the session initialisation and 
prepared statement initialisation for each partition. I can resort to some 
tricks, but I think I'll try increasing batch interval to 15 seconds. I'll 
report back with findings.

Thanks,
Ashic.

From: gerard.m...@gmail.com
Date: Thu, 22 Jan 2015 12:30:08 +0100
Subject: Re: Are these numbers abnormal for spark streaming?
To: tathagata.das1...@gmail.com
CC: as...@live.com; t...@databricks.com; user@spark.apache.org

and post the code (if possible).In a nutshell, your processing time > batch 
interval,  resulting in an ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic is.
-kr, Gerard.
On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these "normal". I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th machine, and that machine has nothing but the driver 
running. The worker nodes are running alongside Cassandra (and reading and 
writing to it).

Any insights would be appreciated.

Regards,
Ashic.

  



  

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hi TD,
Here's some information:

1. Cluster has one standalone master, 4 workers. Workers are co-hosted with 
Apache Cassandra. Master is set up with external Zookeeper.
2. Each machine has 2 cores and 4GB of ram. This is for testing. All machines 
are vmware vms. Spark has 2GB dedicated to it on each node.
3. In addition to the streaming details, the master details as of now are given 
below. Only the streaming app is running.
4. I'm listening to two rabbitmq queues using a rabbitmq receiver (code: 
https://gist.github.com/ashic/b5edc7cfdc85aa60b066 ). Notifier code is here 
https://gist.github.com/ashic/9abd352c691eafc8c9f3 
5. The receivers are initialised with the following code:
val ssc = new StreamingContext(sc, Seconds(2))
val messages1 = ssc.receiverStream(new RmqReceiver("abc", "abc", "/", 
"vdclog03", "abc_input", None))
val messages2 = ssc.receiverStream(new RmqReceiver("abc", "abc", "/", 
"vdclog04", "abc_input", None))
val messages = messages1.union(messages2)
val notifier = new RabbitMQEventNotifier("vdclog03", "abc", 
"abc_output_events", "abc", "abc", "/")

6. Usage:

  messages.map(x => ScalaMessagePack.read[RadioMessage](x))
  .flatMap(InputMessageParser.parse(_).getEvents())
  .foreachRDD(x => {
  x.foreachPartition(x => {
cassandraConnector.withSessionDo(session =>{
  val graphStorage = new CassandraGraphStorage(session)
  val notificationStorage = new CassandraNotificationStorage(session)
  val savingNotifier = new SavingNotifier(notifier, notificationStorage)

  x.foreach(eventWrapper => eventWrapper.event match {
//do some queries.
// save some stuff if needed to cassandra
//raise a message to a separate queue with a msg => Unit() 
operation.

7. The algorithm is simple: listen to messages from two separate rmq queues. 
union them. for each message, check message properties. 
if needed, query cassandra for additional details (graph search..but done in 
0.5-3 seconds...and rare..shouldn't overwhelm with low input rate).
If needed, save some info back into cassandra (1-2ms), and raise an event to 
the notifier.

I'm probably missing something basic, just wondering what. It has been running 
fine for about 42 hours now, but the numbers are a tad worrying.

Cheers,
Ashic.


Workers: 4Cores: 8 Total, 4 UsedMemory: 8.0 GB Total, 2000.0 MB 
UsedApplications: 1 Running, 0 CompletedDrivers: 0 Running, 0 CompletedStatus: 
ALIVEWorkersIdAddressStateCoresMemoryworker-20141208131918-VDCAPP50.AAA.local-44476VDCAPP50.AAA.local:44476ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-20141208132012-VDCAPP52.AAA.local-34349VDCAPP52.AAA.local:34349ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-20141208132136-VDCAPP53.AAA.local-54000VDCAPP53.AAA.local:54000ALIVE2
 (1 Used)2.0 GB (500.0 MB 
Used)worker-2014121627-VDCAPP49.AAA.local-57899VDCAPP49.AAA.local:57899ALIVE2
 (1 Used)2.0 GB (500.0 MB Used)Running ApplicationsIDNameCoresMemory per 
NodeSubmitted TimeUserStateDurationapp-20150120165844-0005App1
4500.0 MB2015/01/20 16:58:44rootWAITING42.4 h

From: tathagata.das1...@gmail.com
Date: Thu, 22 Jan 2015 03:15:58 -0800
Subject: Re: Are these numbers abnormal for spark streaming?
To: as...@live.com; t...@databricks.com
CC: user@spark.apache.org

This is not normal. Its a huge scheduling delay!! Can you tell me more about 
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:



Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Gerard Maas
and post the code (if possible).
In a nutshell, your processing time > batch interval,  resulting in an
ever-increasing delay that will end up in a crash.
3 secs to process 14 messages looks like a lot. Curious what the job logic
is.

-kr, Gerard.

On Thu, Jan 22, 2015 at 12:15 PM, Tathagata Das  wrote:

> This is not normal. Its a huge scheduling delay!! Can you tell me more
> about the application?
> - cluser setup, number of receivers, whats the computation, etc.
>
> On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:
>
>> Hate to do this...but...erm...bump? Would really appreciate input from
>> others using Streaming. Or at least some docs that would tell me if these
>> are expected or not.
>>
>> --
>> From: as...@live.com
>> To: user@spark.apache.org
>> Subject: Are these numbers abnormal for spark streaming?
>> Date: Wed, 21 Jan 2015 11:26:31 +
>>
>>
>> Hi Guys,
>> I've got Spark Streaming set up for a low data rate system (using spark's
>> features for analysis, rather than high throughput). Messages are coming in
>> throughout the day, at around 1-20 per second (finger in the air
>> estimate...not analysed yet).  In the spark streaming UI for the
>> application, I'm getting the following after 17 hours.
>>
>> Streaming
>>
>>- *Started at: *Tue Jan 20 16:58:43 GMT 2015
>>- *Time since start: *18 hours 24 minutes 34 seconds
>>- *Network receivers: *2
>>- *Batch interval: *2 seconds
>>- *Processed batches: *16482
>>- *Waiting batches: *1
>>
>>
>>
>> Statistics over last 100 processed batchesReceiver Statistics
>>
>>- Receiver
>>
>>
>>- Status
>>
>>
>>- Location
>>
>>
>>- Records in last batch
>>- [2015/01/21 11:23:18]
>>
>>
>>- Minimum rate
>>- [records/sec]
>>
>>
>>- Median rate
>>- [records/sec]
>>
>>
>>- Maximum rate
>>- [records/sec]
>>
>>
>>- Last Error
>>
>> RmqReceiver-0ACTIVEF
>> 144727-RmqReceiver-1ACTIVEBR
>> 124726-
>> Batch Processing Statistics
>>
>>MetricLast batchMinimum25th percentileMedian75th 
>> percentileMaximumProcessing
>>Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3
>>ms5 seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours
>>10 minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes
>>57 seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal
>>Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours
>>12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9
>>hours 15 minutes 8 seconds
>>
>>
>> Are these "normal". I was wondering what the scheduling delay and total
>> delay terms are, and if it's normal for them to be 9 hours.
>>
>> I've got a standalone spark master and 4 spark nodes. The streaming app
>> has been given 4 cores, and it's using 1 core per worker node. The
>> streaming app is submitted from a 5th machine, and that machine has nothing
>> but the driver running. The worker nodes are running alongside Cassandra
>> (and reading and writing to it).
>>
>> Any insights would be appreciated.
>>
>> Regards,
>> Ashic.
>>
>
>


Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Tathagata Das
This is not normal. Its a huge scheduling delay!! Can you tell me more
about the application?
- cluser setup, number of receivers, whats the computation, etc.

On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab  wrote:

> Hate to do this...but...erm...bump? Would really appreciate input from
> others using Streaming. Or at least some docs that would tell me if these
> are expected or not.
>
> --
> From: as...@live.com
> To: user@spark.apache.org
> Subject: Are these numbers abnormal for spark streaming?
> Date: Wed, 21 Jan 2015 11:26:31 +
>
>
> Hi Guys,
> I've got Spark Streaming set up for a low data rate system (using spark's
> features for analysis, rather than high throughput). Messages are coming in
> throughout the day, at around 1-20 per second (finger in the air
> estimate...not analysed yet).  In the spark streaming UI for the
> application, I'm getting the following after 17 hours.
>
> Streaming
>
>- *Started at: *Tue Jan 20 16:58:43 GMT 2015
>- *Time since start: *18 hours 24 minutes 34 seconds
>- *Network receivers: *2
>- *Batch interval: *2 seconds
>- *Processed batches: *16482
>- *Waiting batches: *1
>
>
>
> Statistics over last 100 processed batchesReceiver Statistics
>
>- Receiver
>
>
>- Status
>
>
>- Location
>
>
>- Records in last batch
>- [2015/01/21 11:23:18]
>
>
>- Minimum rate
>- [records/sec]
>
>
>- Median rate
>- [records/sec]
>
>
>- Maximum rate
>- [records/sec]
>
>
>- Last Error
>
> RmqReceiver-0ACTIVEF
> 144727-RmqReceiver-1ACTIVEBR
> 124726-
> Batch Processing Statistics
>
>MetricLast batchMinimum25th percentileMedian75th 
> percentileMaximumProcessing
>Time3 seconds 994 ms157 ms4 seconds 16 ms4 seconds 961 ms5 seconds 3 ms5
>seconds 171 msScheduling Delay9 hours 15 minutes 4 seconds9 hours 10
>minutes 54 seconds9 hours 11 minutes 56 seconds9 hours 12 minutes 57
>seconds9 hours 14 minutes 5 seconds9 hours 15 minutes 4 secondsTotal
>Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 seconds9 hours
>12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 seconds9
>hours 15 minutes 8 seconds
>
>
> Are these "normal". I was wondering what the scheduling delay and total
> delay terms are, and if it's normal for them to be 9 hours.
>
> I've got a standalone spark master and 4 spark nodes. The streaming app
> has been given 4 cores, and it's using 1 core per worker node. The
> streaming app is submitted from a 5th machine, and that machine has nothing
> but the driver running. The worker nodes are running alongside Cassandra
> (and reading and writing to it).
>
> Any insights would be appreciated.
>
> Regards,
> Ashic.
>


RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hate to do this...but...erm...bump? Would really appreciate input from others 
using Streaming. Or at least some docs that would tell me if these are expected 
or not.

From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015 11:26:31 +




Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's 
features for analysis, rather than high throughput). Messages are coming in 
throughout the day, at around 1-20 per second (finger in the air estimate...not 
analysed yet).  In the spark streaming UI for the application, I'm getting the 
following after 17 hours.

StreamingStarted at: Tue Jan 20 16:58:43 GMT 2015Time since start: 18 hours 24 
minutes 34 secondsNetwork receivers: 2Batch interval: 2 secondsProcessed 
batches: 16482Waiting batches: 1

Statistics over last 100 processed batchesReceiver 
StatisticsReceiverStatusLocationRecords in last batch[2015/01/21 
11:23:18]Minimum rate[records/sec]Median rate[records/sec]Maximum 
rate[records/sec]Last ErrorRmqReceiver-0ACTIVEF
144727-RmqReceiver-1ACTIVEBR
124726-Batch Processing StatisticsMetricLast batchMinimum25th 
percentileMedian75th percentileMaximumProcessing Time3 seconds 994 ms157 ms4 
seconds 16 ms4 seconds 961 ms5 seconds 3 ms5 seconds 171 msScheduling Delay9 
hours 15 minutes 4 seconds9 hours 10 minutes 54 seconds9 hours 11 minutes 56 
seconds9 hours 12 minutes 57 seconds9 hours 14 minutes 5 seconds9 hours 15 
minutes 4 secondsTotal Delay9 hours 15 minutes 8 seconds9 hours 10 minutes 58 
seconds9 hours 12 minutes9 hours 13 minutes 2 seconds9 hours 14 minutes 10 
seconds9 hours 15 minutes 8 seconds
Are these "normal". I was wondering what the scheduling delay and total delay 
terms are, and if it's normal for them to be 9 hours.

I've got a standalone spark master and 4 spark nodes. The streaming app has 
been given 4 cores, and it's using 1 core per worker node. The streaming app is 
submitted from a 5th machine, and that machine has nothing but the driver 
running. The worker nodes are running alongside Cassandra (and reading and 
writing to it).

Any insights would be appreciated.

Regards,
Ashic.