Re: Looking at EMR Logs
Thanks. That seems to work great, except EMR doesn't always copy the logs to S3. The behavior seems inconsistent and I am debugging it now. On Fri, Mar 31, 2017 at 7:46 AM, Vadim Semenovwrote: > You can provide your own log directory, where Spark log will be saved, and > that you could replay afterwards. > > Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and > run it. > Note! The path `s3://bucket/some/directory` must exist before you run your > job, it'll not be created automatically. > > The Spark HistoryServer on EMR won't show you anything because it's > looking for logs in `hdfs:///var/log/spark/apps` by default. > > After that you can either copy the log files from s3 to the hdfs path > above, or you can copy them locally to `/tmp/spark-events` (the default > directory for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay > wrote: > >> I am looking for tips on evaluating my Spark job after it has run. >> >> I know that right now I can look at the history of jobs through the web >> ui. I also know how to look at the current resources being used by a >> similar web ui. >> >> However, I would like to look at the logs after the job is finished to >> evaluate such things as how many tasks were completed, how many executors >> were used, etc. I currently save my logs to S3. >> >> Thanks! >> >> Henry >> >> -- >> Paul Henry Tremblay >> Robert Half Technology >> > > -- Paul Henry Tremblay Robert Half Technology
Re: Looking at EMR Logs
Modifying spark.eventLog.dir to point to a S3 path, you will encounter the following exception in Spark history log on path: /var/log/spark/spark-history-server.out Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702) To move past this issue, we can do the following. This is for EMR Release: emr-5.4.0 cd /usr/lib/spark/jars sudo ln -s /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.15.0.jar emrfs.jar Now Spark history server will startup correctly and you can review the Spark event logs on S3. On Fri, Mar 31, 2017 at 4:46 PM, Vadim Semenovwrote: > You can provide your own log directory, where Spark log will be saved, and > that you could replay afterwards. > > Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and > run it. > Note! The path `s3://bucket/some/directory` must exist before you run your > job, it'll not be created automatically. > > The Spark HistoryServer on EMR won't show you anything because it's > looking for logs in `hdfs:///var/log/spark/apps` by default. > > After that you can either copy the log files from s3 to the hdfs path > above, or you can copy them locally to `/tmp/spark-events` (the default > directory for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay > wrote: > >> I am looking for tips on evaluating my Spark job after it has run. >> >> I know that right now I can look at the history of jobs through the web >> ui. I also know how to look at the current resources being used by a >> similar web ui. >> >> However, I would like to look at the logs after the job is finished to >> evaluate such things as how many tasks were completed, how many executors >> were used, etc. I currently save my logs to S3. >> >> Thanks! >> >> Henry >> >> -- >> Paul Henry Tremblay >> Robert Half Technology >> > >
Re: Looking at EMR Logs
You can provide your own log directory, where Spark log will be saved, and that you could replay afterwards. Set in your job this: `spark.eventLog.dir=s3://bucket/some/directory` and run it. Note! The path `s3://bucket/some/directory` must exist before you run your job, it'll not be created automatically. The Spark HistoryServer on EMR won't show you anything because it's looking for logs in `hdfs:///var/log/spark/apps` by default. After that you can either copy the log files from s3 to the hdfs path above, or you can copy them locally to `/tmp/spark-events` (the default directory for spark logs) and run the history server like: ``` cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 sbin/start-history-server.sh ``` and then open http://localhost:18080 On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblaywrote: > I am looking for tips on evaluating my Spark job after it has run. > > I know that right now I can look at the history of jobs through the web > ui. I also know how to look at the current resources being used by a > similar web ui. > > However, I would like to look at the logs after the job is finished to > evaluate such things as how many tasks were completed, how many executors > were used, etc. I currently save my logs to S3. > > Thanks! > > Henry > > -- > Paul Henry Tremblay > Robert Half Technology >
Looking at EMR Logs
I am looking for tips on evaluating my Spark job after it has run. I know that right now I can look at the history of jobs through the web ui. I also know how to look at the current resources being used by a similar web ui. However, I would like to look at the logs after the job is finished to evaluate such things as how many tasks were completed, how many executors were used, etc. I currently save my logs to S3. Thanks! Henry -- Paul Henry Tremblay Robert Half Technology