Re: throughput in the web console?
Sorry I made a mistake. Please ignore my question. On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa wrote: > I performed repartitioning and everything went fine with respect to the > number of CPU cores being used (and respective times). However, I noticed > something very strange: inside a map operation I was doing a very simple > calculation and always using the same dataset (small enough to be entirely > processed in the same batch); then I iterated the RDDs and calculated the > mean, "foreachRDD(rdd => println("MEAN: " + rdd.mean()))". I noticed that > for different numbers of partitions (for instance, 4 and 8), the result of > the mean is different. Why does this happen? > > On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das > wrote: > >> If you have one receiver, and you are doing only map-like operaitons then >> the process will primarily happen on one machine. To use all the machines, >> either receiver in parallel with multiple receivers, or spread out the >> computation by explicitly repartitioning the received streams >> (DStream.repartition) with sufficient partitions to load balance across >> more machines. >> >> TD >> >> On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa >> wrote: >> >>> One more question: while processing the exact same batch I noticed that >>> giving more CPUs to the worker does not decrease the duration of the batch. >>> I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU >>> the duration increased, but apart from that the values were pretty similar, >>> whether I was using 4 or 6 or 8 CPUs. >>> >>> On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa >>> wrote: >>> By setting spark.eventLog.enabled to true it is possible to see the application UI after the application has finished its execution, however the Streaming tab is no longer visible. For measuring the duration of batches in the code I am doing something like this: «wordCharValues.foreachRDD(rdd => { val startTick = System.currentTimeMillis() val result = rdd.take(1) val timeDiff = System.currentTimeMillis() - startTick» But my quesiton is: is it possible to see the rate/throughput (records/sec) when I have a stream to process log files that appear in a folder? On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das wrote: > Yes. # tuples processed in a batch = sum of all the tuples received by > all the receivers. > > In screen shot, there was a batch with 69.9K records, and there was a > batch which took 1 s 473 ms. These two batches can be the same, can be > different batches. > > TD > > On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: > >> If I'm using the kafka receiver, can I assume the number of records >> processed in the batch is the sum of the number of records processed by >> the >> kafka receiver? >> >> So in the screen shot attached the max rate of tuples processed in a >> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max >> processing time of 1 second 473 ms? >> >> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das < >> ak...@sigmoidanalytics.com> wrote: >> >>> By throughput you mean Number of events processed etc? >>> >>> [image: Inline image 1] >>> >>> Streaming tab already have these statistics. >>> >>> >>> >>> Thanks >>> Best Regards >>> >>> On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: >>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das < ak...@sigmoidanalytics.com> wrote: > For SparkStreaming applications, there is already a tab called > "Streaming" which displays the basic statistics. Would I just need to extend this tab to add the throughput? >>> >>> >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > >>> >> >
Re: throughput in the web console?
I performed repartitioning and everything went fine with respect to the number of CPU cores being used (and respective times). However, I noticed something very strange: inside a map operation I was doing a very simple calculation and always using the same dataset (small enough to be entirely processed in the same batch); then I iterated the RDDs and calculated the mean, "foreachRDD(rdd => println("MEAN: " + rdd.mean()))". I noticed that for different numbers of partitions (for instance, 4 and 8), the result of the mean is different. Why does this happen? On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das wrote: > If you have one receiver, and you are doing only map-like operaitons then > the process will primarily happen on one machine. To use all the machines, > either receiver in parallel with multiple receivers, or spread out the > computation by explicitly repartitioning the received streams > (DStream.repartition) with sufficient partitions to load balance across > more machines. > > TD > > On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa > wrote: > >> One more question: while processing the exact same batch I noticed that >> giving more CPUs to the worker does not decrease the duration of the batch. >> I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU >> the duration increased, but apart from that the values were pretty similar, >> whether I was using 4 or 6 or 8 CPUs. >> >> On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa >> wrote: >> >>> By setting spark.eventLog.enabled to true it is possible to see the >>> application UI after the application has finished its execution, however >>> the Streaming tab is no longer visible. >>> >>> For measuring the duration of batches in the code I am doing something >>> like this: >>> «wordCharValues.foreachRDD(rdd => { >>> val startTick = System.currentTimeMillis() >>> val result = rdd.take(1) >>> val timeDiff = System.currentTimeMillis() - startTick» >>> >>> But my quesiton is: is it possible to see the rate/throughput >>> (records/sec) when I have a stream to process log files that appear in a >>> folder? >>> >>> >>> >>> On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das >>> wrote: >>> Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: > If I'm using the kafka receiver, can I assume the number of records > processed in the batch is the sum of the number of records processed by > the > kafka receiver? > > So in the screen shot attached the max rate of tuples processed in a > batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max > processing time of 1 second 473 ms? > > On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das > wrote: > >> By throughput you mean Number of events processed etc? >> >> [image: Inline image 1] >> >> Streaming tab already have these statistics. >> >> >> >> Thanks >> Best Regards >> >> On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: >> >>> >>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das < >>> ak...@sigmoidanalytics.com> wrote: >>> For SparkStreaming applications, there is already a tab called "Streaming" which displays the basic statistics. >>> >>> >>> Would I just need to extend this tab to add the throughput? >>> >> >> > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >>> >> >
Re: throughput in the web console?
If you have one receiver, and you are doing only map-like operaitons then the process will primarily happen on one machine. To use all the machines, either receiver in parallel with multiple receivers, or spread out the computation by explicitly repartitioning the received streams (DStream.repartition) with sufficient partitions to load balance across more machines. TD On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa wrote: > One more question: while processing the exact same batch I noticed that > giving more CPUs to the worker does not decrease the duration of the batch. > I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU > the duration increased, but apart from that the values were pretty similar, > whether I was using 4 or 6 or 8 CPUs. > > On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa > wrote: > >> By setting spark.eventLog.enabled to true it is possible to see the >> application UI after the application has finished its execution, however >> the Streaming tab is no longer visible. >> >> For measuring the duration of batches in the code I am doing something >> like this: >> «wordCharValues.foreachRDD(rdd => { >> val startTick = System.currentTimeMillis() >> val result = rdd.take(1) >> val timeDiff = System.currentTimeMillis() - startTick» >> >> But my quesiton is: is it possible to see the rate/throughput >> (records/sec) when I have a stream to process log files that appear in a >> folder? >> >> >> >> On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das >> wrote: >> >>> Yes. # tuples processed in a batch = sum of all the tuples received by >>> all the receivers. >>> >>> In screen shot, there was a batch with 69.9K records, and there was a >>> batch which took 1 s 473 ms. These two batches can be the same, can be >>> different batches. >>> >>> TD >>> >>> On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: >>> If I'm using the kafka receiver, can I assume the number of records processed in the batch is the sum of the number of records processed by the kafka receiver? So in the screen shot attached the max rate of tuples processed in a batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max processing time of 1 second 473 ms? On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das wrote: > By throughput you mean Number of events processed etc? > > [image: Inline image 1] > > Streaming tab already have these statistics. > > > > Thanks > Best Regards > > On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: > >> >> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das < >> ak...@sigmoidanalytics.com> wrote: >> >>> For SparkStreaming applications, there is already a tab called >>> "Streaming" which displays the basic statistics. >> >> >> Would I just need to extend this tab to add the throughput? >> > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: throughput in the web console?
One more question: while processing the exact same batch I noticed that giving more CPUs to the worker does not decrease the duration of the batch. I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU the duration increased, but apart from that the values were pretty similar, whether I was using 4 or 6 or 8 CPUs. On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa wrote: > By setting spark.eventLog.enabled to true it is possible to see the > application UI after the application has finished its execution, however > the Streaming tab is no longer visible. > > For measuring the duration of batches in the code I am doing something > like this: > «wordCharValues.foreachRDD(rdd => { > val startTick = System.currentTimeMillis() > val result = rdd.take(1) > val timeDiff = System.currentTimeMillis() - startTick» > > But my quesiton is: is it possible to see the rate/throughput > (records/sec) when I have a stream to process log files that appear in a > folder? > > > > On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das > wrote: > >> Yes. # tuples processed in a batch = sum of all the tuples received by >> all the receivers. >> >> In screen shot, there was a batch with 69.9K records, and there was a >> batch which took 1 s 473 ms. These two batches can be the same, can be >> different batches. >> >> TD >> >> On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: >> >>> If I'm using the kafka receiver, can I assume the number of records >>> processed in the batch is the sum of the number of records processed by the >>> kafka receiver? >>> >>> So in the screen shot attached the max rate of tuples processed in a >>> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max >>> processing time of 1 second 473 ms? >>> >>> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das >>> wrote: >>> By throughput you mean Number of events processed etc? [image: Inline image 1] Streaming tab already have these statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: > > On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das > wrote: > >> For SparkStreaming applications, there is already a tab called >> "Streaming" which displays the basic statistics. > > > Would I just need to extend this tab to add the throughput? > >>> >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >> >
Re: throughput in the web console?
By setting spark.eventLog.enabled to true it is possible to see the application UI after the application has finished its execution, however the Streaming tab is no longer visible. For measuring the duration of batches in the code I am doing something like this: «wordCharValues.foreachRDD(rdd => { val startTick = System.currentTimeMillis() val result = rdd.take(1) val timeDiff = System.currentTimeMillis() - startTick» But my quesiton is: is it possible to see the rate/throughput (records/sec) when I have a stream to process log files that appear in a folder? On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das wrote: > Yes. # tuples processed in a batch = sum of all the tuples received by all > the receivers. > > In screen shot, there was a batch with 69.9K records, and there was a > batch which took 1 s 473 ms. These two batches can be the same, can be > different batches. > > TD > > On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: > >> If I'm using the kafka receiver, can I assume the number of records >> processed in the batch is the sum of the number of records processed by the >> kafka receiver? >> >> So in the screen shot attached the max rate of tuples processed in a >> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max >> processing time of 1 second 473 ms? >> >> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das >> wrote: >> >>> By throughput you mean Number of events processed etc? >>> >>> [image: Inline image 1] >>> >>> Streaming tab already have these statistics. >>> >>> >>> >>> Thanks >>> Best Regards >>> >>> On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: >>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das wrote: > For SparkStreaming applications, there is already a tab called > "Streaming" which displays the basic statistics. Would I just need to extend this tab to add the throughput? >>> >>> >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > >
Re: throughput in the web console?
Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J wrote: > If I'm using the kafka receiver, can I assume the number of records > processed in the batch is the sum of the number of records processed by the > kafka receiver? > > So in the screen shot attached the max rate of tuples processed in a batch > is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max processing > time of 1 second 473 ms? > > On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das > wrote: > >> By throughput you mean Number of events processed etc? >> >> [image: Inline image 1] >> >> Streaming tab already have these statistics. >> >> >> >> Thanks >> Best Regards >> >> On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: >> >>> >>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das >>> wrote: >>> For SparkStreaming applications, there is already a tab called "Streaming" which displays the basic statistics. >>> >>> >>> Would I just need to extend this tab to add the throughput? >>> >> >> > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
Re: throughput in the web console?
Hi Josh, SPM will show you this info. I see you use Kafka, too, whose numerous metrics you can also see in SPM side by side with your Spark metrics. Sounds like trends is what you are after, so I hope this helps. See http://sematext.com/spm Otis > On Feb 24, 2015, at 11:59, Josh J wrote: > > Hi, > > I plan to run a parameter search varying the number of cores, epoch, and > parallelism. The web console provides a way to archive the previous runs, > though is there a way to view in the console the throughput? Rather than > logging the throughput separately to the log files and correlating the logs > files to the web console processing times? > > Thanks, > Josh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: throughput in the web console?
By throughput you mean Number of events processed etc? [image: Inline image 1] Streaming tab already have these statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 9:59 PM, Josh J wrote: > > On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das > wrote: > >> For SparkStreaming applications, there is already a tab called >> "Streaming" which displays the basic statistics. > > > Would I just need to extend this tab to add the throughput? >
Re: throughput in the web console?
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das wrote: > For SparkStreaming applications, there is already a tab called "Streaming" > which displays the basic statistics. Would I just need to extend this tab to add the throughput?
Re: throughput in the web console?
For SparkStreaming applications, there is already a tab called "Streaming" which displays the basic statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 8:55 PM, Josh J wrote: > Let me ask like this, what would be the easiest way to display the > throughput in the web console? Would I need to create a new tab and add the > metrics? Any good or simple examples showing how this can be done? > > On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das > wrote: > >> Did you have a look at >> >> >> https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener >> >> And for Streaming: >> >> >> https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener >> >> >> >> Thanks >> Best Regards >> >> On Tue, Feb 24, 2015 at 10:29 PM, Josh J wrote: >> >>> Hi, >>> >>> I plan to run a parameter search varying the number of cores, epoch, and >>> parallelism. The web console provides a way to archive the previous runs, >>> though is there a way to view in the console the throughput? Rather than >>> logging the throughput separately to the log files and correlating the logs >>> files to the web console processing times? >>> >>> Thanks, >>> Josh >>> >> >> >
Re: throughput in the web console?
Let me ask like this, what would be the easiest way to display the throughput in the web console? Would I need to create a new tab and add the metrics? Any good or simple examples showing how this can be done? On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das wrote: > Did you have a look at > > > https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener > > And for Streaming: > > > https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener > > > > Thanks > Best Regards > > On Tue, Feb 24, 2015 at 10:29 PM, Josh J wrote: > >> Hi, >> >> I plan to run a parameter search varying the number of cores, epoch, and >> parallelism. The web console provides a way to archive the previous runs, >> though is there a way to view in the console the throughput? Rather than >> logging the throughput separately to the log files and correlating the logs >> files to the web console processing times? >> >> Thanks, >> Josh >> > >
Re: throughput in the web console?
Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener And for Streaming: https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Thanks Best Regards On Tue, Feb 24, 2015 at 10:29 PM, Josh J wrote: > Hi, > > I plan to run a parameter search varying the number of cores, epoch, and > parallelism. The web console provides a way to archive the previous runs, > though is there a way to view in the console the throughput? Rather than > logging the throughput separately to the log files and correlating the logs > files to the web console processing times? > > Thanks, > Josh >