On yarn, logs are aggregated from each containers to hdfs. You can use yarn CLI or ui to view. For spark, you would have a history server which consolidate s the logs On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:
> I looked at the driver logs, that reminded me that I needed to look at the > executor logs. There the issue was that the spark executors were not > getting a configuration file. I broadcasted the file and now the processing > happens. Thanks for the suggestion. > Currently my issue is that the log file generated independently by the > executors goes to the respective containers' appcache, and then it gets > lost. Is there a recommended way to get the output files from the > individual executors? > > On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> > wrote: > >> Are you looking at the worker logs or the driver? >> >> >> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com> >> wrote: >> >>> I have an RDD created as follows: >>> >>> * JavaPairRDD<String,String> inputDataFiles = >>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* >>> >>> On this RDD I perform a map to process individual files and invoke a >>> foreach to trigger the same map. >>> >>> * JavaRDD<Object[]> output = inputDataFiles.map(new >>> Function<Tuple2<String,String>,Object[]>()* >>> * {* >>> >>> * private static final long serialVersionUID = 1L;* >>> >>> * @Override* >>> * public Object[] call(Tuple2<String,String> v1) throws Exception * >>> * { * >>> * System.out.println("in map!");* >>> * //do something with v1. * >>> * return Object[]* >>> * } * >>> * });* >>> >>> * output.foreach(new VoidFunction<Object[]>() {* >>> >>> * private static final long serialVersionUID = 1L;* >>> >>> * @Override* >>> * public void call(Object[] t) throws Exception {* >>> * //do nothing!* >>> * System.out.println("in foreach!");* >>> * }* >>> * }); * >>> >>> This code works perfectly fine for standalone setup on my local laptop >>> while accessing both local files as well as remote HDFS files. >>> >>> In cluster the same code produces no results. My intuition is that the >>> data has not reached the individual executors and hence both the `map` and >>> `foreach` does not work. It might be a guess. But I am not able to figure >>> out why this would not work in cluster. I dont even see the print >>> statements in `map` and `foreach` getting printed in cluster mode of >>> execution. >>> >>> I notice a particular line in standalone output that I do NOT see in >>> cluster execution. >>> >>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split: >>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345* >>> >>> I had a similar code with textFile() that worked earlier for individual >>> files on cluster. The issue is with wholeTextFiles() only. >>> >>> Please advise what is the best way to get this working or other >>> alternate ways. >>> >>> My setup is cloudera 5.7 distribution with Spark Service. I used the >>> master as `yarn-client`. >>> >>> The action can be anything. Its just a dummy step to invoke the map. I >>> also tried *System.out.println("Count is:"+output.count());*, for which >>> I got the correct answer of `10`, since there were 10 files in the folder, >>> but still the map refuses to work. >>> >>> Thanks. >>> >>> >> >> -- >> Thanks, >> Sonal >> Nube Technologies <http://www.nubetech.co> >> >> <http://in.linkedin.com/in/sonalgoyal> >> >> >> >> > > > -- > Nisha Menon > BTech (CS) Sahrdaya CET, > MTech (CS) IIIT Banglore. >