Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
Well I have already tried that.
You are talking about a command similar to this right? *yarn logs
-applicationId application_Number *
This gives me the processing logs, that contain information about the
tasks, RDD blocks etc.

What I really need is the output log that gets generated as part of the
Spark job. Which means I generate some output by the Spark job that gets
written to a file mentioned in the job itself. So this file is currently
residing within the appcache, is there a way that I can get this once the
job is over?



On Wed, Sep 21, 2016 at 4:00 PM, ayan guha  wrote:

> On yarn, logs are aggregated from each containers to hdfs. You can use
> yarn CLI or ui to view. For spark, you would have a history server which
> consolidate s the logs
> On 21 Sep 2016 19:03, "Nisha Menon"  wrote:
>
>> I looked at the driver logs, that reminded me that I needed to look at
>> the executor logs. There the issue was that the spark executors were not
>> getting a configuration file. I broadcasted the file and now the processing
>> happens. Thanks for the suggestion.
>> Currently my issue is that the log file generated independently by the
>> executors goes to the respective containers' appcache, and then it gets
>> lost. Is there a recommended way to get the output files from the
>> individual executors?
>>
>> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal 
>> wrote:
>>
>>> Are you looking at the worker logs or the driver?
>>>
>>>
>>> On Thursday, September 8, 2016, Nisha Menon 
>>> wrote:
>>>
 I have an RDD created as follows:

 *JavaPairRDD inputDataFiles =
 sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*

 On this RDD I perform a map to process individual files and invoke a
 foreach to trigger the same map.

* JavaRDD output = inputDataFiles.map(new
 Function,Object[]>()*
 *{*

 *private static final long serialVersionUID = 1L;*

 * @Override*
 * public Object[] call(Tuple2 v1) throws Exception *
 *{ *
 *  System.out.println("in map!");*
 *   //do something with v1. *
 *  return Object[]*
 *} *
 *});*

 *output.foreach(new VoidFunction() {*

 * private static final long serialVersionUID = 1L;*

 * @Override*
 * public void call(Object[] t) throws Exception {*
 * //do nothing!*
 * System.out.println("in foreach!");*
 * }*
 * }); *

 This code works perfectly fine for standalone setup on my local laptop
 while accessing both local files as well as remote HDFS files.

 In cluster the same code produces no results. My intuition is that the
 data has not reached the individual executors and hence both the `map` and
 `foreach` does not work. It might be a guess. But I am not able to figure
 out why this would not work in cluster. I dont even see the print
 statements in `map` and `foreach` getting printed in cluster mode of
 execution.

 I notice a particular line in standalone output that I do NOT see in
 cluster execution.

 *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
 Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*

 I had a similar code with textFile() that worked earlier for individual
 files on cluster. The issue is with wholeTextFiles() only.

 Please advise what is the best way to get this working or other
 alternate ways.

 My setup is cloudera 5.7 distribution with Spark Service. I used the
 master as `yarn-client`.

 The action can be anything. Its just a dummy step to invoke the map. I
 also tried *System.out.println("Count is:"+output.count());*, for
 which I got the correct answer of `10`, since there were 10 files in the
 folder, but still the map refuses to work.

 Thanks.


>>>
>>> --
>>> Thanks,
>>> Sonal
>>> Nube Technologies 
>>>
>>> 
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nisha Menon
>> BTech (CS) Sahrdaya CET,
>> MTech (CS) IIIT Banglore.
>>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.


Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread ayan guha
On yarn, logs are aggregated from each containers to hdfs. You can use yarn
CLI or ui to view. For spark, you would have a history server which
consolidate s the logs
On 21 Sep 2016 19:03, "Nisha Menon"  wrote:

> I looked at the driver logs, that reminded me that I needed to look at the
> executor logs. There the issue was that the spark executors were not
> getting a configuration file. I broadcasted the file and now the processing
> happens. Thanks for the suggestion.
> Currently my issue is that the log file generated independently by the
> executors goes to the respective containers' appcache, and then it gets
> lost. Is there a recommended way to get the output files from the
> individual executors?
>
> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal 
> wrote:
>
>> Are you looking at the worker logs or the driver?
>>
>>
>> On Thursday, September 8, 2016, Nisha Menon 
>> wrote:
>>
>>> I have an RDD created as follows:
>>>
>>> *JavaPairRDD inputDataFiles =
>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>
>>> On this RDD I perform a map to process individual files and invoke a
>>> foreach to trigger the same map.
>>>
>>>* JavaRDD output = inputDataFiles.map(new
>>> Function,Object[]>()*
>>> *{*
>>>
>>> *private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public Object[] call(Tuple2 v1) throws Exception *
>>> *{ *
>>> *  System.out.println("in map!");*
>>> *   //do something with v1. *
>>> *  return Object[]*
>>> *} *
>>> *});*
>>>
>>> *output.foreach(new VoidFunction() {*
>>>
>>> * private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public void call(Object[] t) throws Exception {*
>>> * //do nothing!*
>>> * System.out.println("in foreach!");*
>>> * }*
>>> * }); *
>>>
>>> This code works perfectly fine for standalone setup on my local laptop
>>> while accessing both local files as well as remote HDFS files.
>>>
>>> In cluster the same code produces no results. My intuition is that the
>>> data has not reached the individual executors and hence both the `map` and
>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>> out why this would not work in cluster. I dont even see the print
>>> statements in `map` and `foreach` getting printed in cluster mode of
>>> execution.
>>>
>>> I notice a particular line in standalone output that I do NOT see in
>>> cluster execution.
>>>
>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>
>>> I had a similar code with textFile() that worked earlier for individual
>>> files on cluster. The issue is with wholeTextFiles() only.
>>>
>>> Please advise what is the best way to get this working or other
>>> alternate ways.
>>>
>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>> master as `yarn-client`.
>>>
>>> The action can be anything. Its just a dummy step to invoke the map. I
>>> also tried *System.out.println("Count is:"+output.count());*, for which
>>> I got the correct answer of `10`, since there were 10 files in the folder,
>>> but still the map refuses to work.
>>>
>>> Thanks.
>>>
>>>
>>
>> --
>> Thanks,
>> Sonal
>> Nube Technologies 
>>
>> 
>>
>>
>>
>>
>
>
> --
> Nisha Menon
> BTech (CS) Sahrdaya CET,
> MTech (CS) IIIT Banglore.
>


Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
Are you looking at the worker logs or the driver?

On Thursday, September 8, 2016, Nisha Menon  wrote:

> I have an RDD created as follows:
>
> *JavaPairRDD inputDataFiles =
> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>
> On this RDD I perform a map to process individual files and invoke a
> foreach to trigger the same map.
>
>* JavaRDD output = inputDataFiles.map(new
> Function,Object[]>()*
> *{*
>
> *private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public Object[] call(Tuple2 v1) throws Exception *
> *{ *
> *  System.out.println("in map!");*
> *   //do something with v1. *
> *  return Object[]*
> *} *
> *});*
>
> *output.foreach(new VoidFunction() {*
>
> * private static final long serialVersionUID = 1L;*
>
> * @Override*
> * public void call(Object[] t) throws Exception {*
> * //do nothing!*
> * System.out.println("in foreach!");*
> * }*
> * }); *
>
> This code works perfectly fine for standalone setup on my local laptop
> while accessing both local files as well as remote HDFS files.
>
> In cluster the same code produces no results. My intuition is that the
> data has not reached the individual executors and hence both the `map` and
> `foreach` does not work. It might be a guess. But I am not able to figure
> out why this would not work in cluster. I dont even see the print
> statements in `map` and `foreach` getting printed in cluster mode of
> execution.
>
> I notice a particular line in standalone output that I do NOT see in
> cluster execution.
>
> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>
> I had a similar code with textFile() that worked earlier for individual
> files on cluster. The issue is with wholeTextFiles() only.
>
> Please advise what is the best way to get this working or other alternate
> ways.
>
> My setup is cloudera 5.7 distribution with Spark Service. I used the
> master as `yarn-client`.
>
> The action can be anything. Its just a dummy step to invoke the map. I
> also tried *System.out.println("Count is:"+output.count());*, for which I
> got the correct answer of `10`, since there were 10 files in the folder,
> but still the map refuses to work.
>
> Thanks.
>
>

-- 
Thanks,
Sonal
Nube Technologies