Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-11 Thread Roberto Coluccio
Thanks for your advice, Steve.

I'm mainly talking about application logs. To be more clear, just for
instance think about the
"//hadoop/userlogs/application_blablabla/container_blablabla/stderr_or_stdout".
So YARN's applications containers logs, stored (at least for EMR's hadoop
2.4) on DataNodes and aggregated/pushed only once the application completes.

"yarn logs" issued from the cluster Master doesn't allow you to on-demand
aggregate logs for applications the are in running/active state.

For now I managed to install the awslogs agent (
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CWL_GettingStarted.html)
on
DataNodes so to push containers logs in real-time to CloudWatch logs, but
that's kinda of a workaround too, this is why I was wondering what the
community (in general, not only on EMR) uses to real-time monitor
application logs (in an automated fashion) for long-running processes like
streaming driver and if are there out-of-the-box solutions.

Thanks,

Roberto





On Thu, Dec 10, 2015 at 3:06 PM, Steve Loughran 
wrote:

>
> > On 10 Dec 2015, at 14:52, Roberto Coluccio 
> wrote:
> >
> > Hello,
> >
> > I'm investigating on a solution to real-time monitor Spark logs produced
> by my EMR cluster in order to collect statistics and trigger alarms. Being
> on EMR, I found the CloudWatch Logs + Lambda pretty straightforward and,
> since I'm on AWS, those service are pretty well integrated together..but I
> could just find examples about it using on standalone EC2 instances.
> >
> > In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster
> mode), I would like to be able to real-time monitor Spark logs, so not just
> about when the processing ends and they are copied to S3. Is there any
> out-of-the-box solution or best-practice for accomplish this goal when
> running on EMR that I'm not aware of?
> >
> > Spark logs are written on the Data Nodes (Core Instances) local file
> systems as YARN containers logs, so probably installing the awslogs agent
> on them and pointing to those logfiles would help pushing such logs on
> CloudWatch, but I was wondering how the community real-time monitors
> application logs when running Spark on YARN on EMR.
> >
> > Or maybe I'm looking at a wrong solution. Maybe the correct way would be
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs
> directly to the sink and the sink pushing them to CloudWatch (I do like the
> out-of-the-box EMR logging experience and I want to keep the usual eventual
> logs archiving on S3 when the EMR cluster is terminated).
> >
> > Any ideas or experience about this problem?
> >
> > Thank you.
> >
> > Roberto
>
>
> are you talking about event logs as used by the history server, or
> application logs?
>
> the current spark log server writes events to a file, but as the hadoop s3
> fs client doesn't write except in close(), they won't be pushed out while
> thing are running. Someone (you?) could have a go at implementing a new
> event listener; some stuff that will come out in Spark 2.0 will make it
> easier to wire this up (SPARK-11314), which is coming as part of some work
> on spark-YARN timelineserver itnegration.
>
> In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn
> Nodemanagers and automatically copied out, look at
> yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For
> that to work you need to set up your log wildcard patterns to for the NM to
> locate (i.e. have rolling logs with the right extensions)...the details
> escape me right now
>
> In earlier versions, you can use "yarn logs' to grab them and pull them
> down.
>
> I don't know anything about cloudwatch integration, sorry
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on EMR: out-of-the-box solution for real-time application logs monitoring?

2015-12-10 Thread Steve Loughran

> On 10 Dec 2015, at 14:52, Roberto Coluccio  wrote:
> 
> Hello,
> 
> I'm investigating on a solution to real-time monitor Spark logs produced by 
> my EMR cluster in order to collect statistics and trigger alarms. Being on 
> EMR, I found the CloudWatch Logs + Lambda pretty straightforward and, since 
> I'm on AWS, those service are pretty well integrated together..but I could 
> just find examples about it using on standalone EC2 instances.
> 
> In my use case, EMR 3.9 and Spark 1.4.1 drivers running on YARN (cluster 
> mode), I would like to be able to real-time monitor Spark logs, so not just 
> about when the processing ends and they are copied to S3. Is there any 
> out-of-the-box solution or best-practice for accomplish this goal when 
> running on EMR that I'm not aware of?
> 
> Spark logs are written on the Data Nodes (Core Instances) local file systems 
> as YARN containers logs, so probably installing the awslogs agent on them and 
> pointing to those logfiles would help pushing such logs on CloudWatch, but I 
> was wondering how the community real-time monitors application logs when 
> running Spark on YARN on EMR.
> 
> Or maybe I'm looking at a wrong solution. Maybe the correct way would be 
> using something like a CloudwatchSink so to make Spark (log4j) pushing logs 
> directly to the sink and the sink pushing them to CloudWatch (I do like the 
> out-of-the-box EMR logging experience and I want to keep the usual eventual 
> logs archiving on S3 when the EMR cluster is terminated).
> 
> Any ideas or experience about this problem?
> 
> Thank you.
> 
> Roberto


are you talking about event logs as used by the history server, or application 
logs?

the current spark log server writes events to a file, but as the hadoop s3 fs 
client doesn't write except in close(), they won't be pushed out while thing 
are running. Someone (you?) could have a go at implementing a new event 
listener; some stuff that will come out in Spark 2.0 will make it easier to 
wire this up (SPARK-11314), which is coming as part of some work on spark-YARN 
timelineserver itnegration.

In Hadoop 2.7.1 The log4j logs can be regularly captured by the Yarn 
Nodemanagers and automatically copied out, look at 
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds . For that to 
work you need to set up your log wildcard patterns to for the NM to locate 
(i.e. have rolling logs with the right extensions)...the details escape me 
right now

In earlier versions, you can use "yarn logs' to grab them and pull them down.

I don't know anything about cloudwatch integration, sorry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org