How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Nisha Menon
I have an RDD created as follows:

*JavaPairRDD inputDataFiles =
sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*

On this RDD I perform a map to process individual files and invoke a
foreach to trigger the same map.

   * JavaRDD output = inputDataFiles.map(new
Function,Object[]>()*
*{*

*private static final long serialVersionUID = 1L;*

* @Override*
* public Object[] call(Tuple2 v1) throws Exception *
*{ *
*  System.out.println("in map!");*
*   //do something with v1. *
*  return Object[]*
*} *
*});*

*output.foreach(new VoidFunction() {*

* private static final long serialVersionUID = 1L;*

* @Override*
* public void call(Object[] t) throws Exception {*
* //do nothing!*
* System.out.println("in foreach!");*
* }*
* }); *

This code works perfectly fine for standalone setup on my local laptop
while accessing both local files as well as remote HDFS files.

In cluster the same code produces no results. My intuition is that the data
has not reached the individual executors and hence both the `map` and
`foreach` does not work. It might be a guess. But I am not able to figure
out why this would not work in cluster. I dont even see the print
statements in `map` and `foreach` getting printed in cluster mode of
execution.

I notice a particular line in standalone output that I do NOT see in
cluster execution.

*16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*

I had a similar code with textFile() that worked earlier for individual
files on cluster. The issue is with wholeTextFiles() only.

Please advise what is the best way to get this working or other alternate
ways.

My setup is cloudera 5.7 distribution with Spark Service. I used the master
as `yarn-client`.

The action can be anything. Its just a dummy step to invoke the map. I also
tried *System.out.println("Count is:"+output.count());*, for which I got
the correct answer of `10`, since there were 10 files in the folder, but
still the map refuses to work.

Thanks.


How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
I looked at the driver logs, that reminded me that I needed to look at the
executor logs. There the issue was that the spark executors were not
getting a configuration file. I broadcasted the file and now the processing
happens. Thanks for the suggestion.
Currently my issue is that the log file generated independently by the
executors goes to the respective containers' appcache, and then it gets
lost. Is there a recommended way to get the output files from the
individual executors?

On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Are you looking at the worker logs or the driver?
>
>
> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
> wrote:
>
>> I have an RDD created as follows:
>>
>> *JavaPairRDD<String,String> inputDataFiles =
>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>
>> On this RDD I perform a map to process individual files and invoke a
>> foreach to trigger the same map.
>>
>>* JavaRDD<Object[]> output = inputDataFiles.map(new
>> Function<Tuple2<String,String>,Object[]>()*
>> *{*
>>
>> *private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>> *{ *
>> *  System.out.println("in map!");*
>> *   //do something with v1. *
>> *  return Object[]*
>> *} *
>> *});*
>>
>> *output.foreach(new VoidFunction<Object[]>() {*
>>
>> * private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public void call(Object[] t) throws Exception {*
>> * //do nothing!*
>> * System.out.println("in foreach!");*
>> * }*
>> * }); *
>>
>> This code works perfectly fine for standalone setup on my local laptop
>> while accessing both local files as well as remote HDFS files.
>>
>> In cluster the same code produces no results. My intuition is that the
>> data has not reached the individual executors and hence both the `map` and
>> `foreach` does not work. It might be a guess. But I am not able to figure
>> out why this would not work in cluster. I dont even see the print
>> statements in `map` and `foreach` getting printed in cluster mode of
>> execution.
>>
>> I notice a particular line in standalone output that I do NOT see in
>> cluster execution.
>>
>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>
>> I had a similar code with textFile() that worked earlier for individual
>> files on cluster. The issue is with wholeTextFiles() only.
>>
>> Please advise what is the best way to get this working or other alternate
>> ways.
>>
>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>> master as `yarn-client`.
>>
>> The action can be anything. Its just a dummy step to invoke the map. I
>> also tried *System.out.println("Count is:"+output.count());*, for which
>> I got the correct answer of `10`, since there were 10 files in the folder,
>> but still the map refuses to work.
>>
>> Thanks.
>>
>>
>
> --
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.


Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-21 Thread Nisha Menon
Well I have already tried that.
You are talking about a command similar to this right? *yarn logs
-applicationId application_Number *
This gives me the processing logs, that contain information about the
tasks, RDD blocks etc.

What I really need is the output log that gets generated as part of the
Spark job. Which means I generate some output by the Spark job that gets
written to a file mentioned in the job itself. So this file is currently
residing within the appcache, is there a way that I can get this once the
job is over?



On Wed, Sep 21, 2016 at 4:00 PM, ayan guha <guha.a...@gmail.com> wrote:

> On yarn, logs are aggregated from each containers to hdfs. You can use
> yarn CLI or ui to view. For spark, you would have a history server which
> consolidate s the logs
> On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:
>
>> I looked at the driver logs, that reminded me that I needed to look at
>> the executor logs. There the issue was that the spark executors were not
>> getting a configuration file. I broadcasted the file and now the processing
>> happens. Thanks for the suggestion.
>> Currently my issue is that the log file generated independently by the
>> executors goes to the respective containers' appcache, and then it gets
>> lost. Is there a recommended way to get the output files from the
>> individual executors?
>>
>> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com>
>> wrote:
>>
>>> Are you looking at the worker logs or the driver?
>>>
>>>
>>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
>>> wrote:
>>>
>>>> I have an RDD created as follows:
>>>>
>>>> *JavaPairRDD<String,String> inputDataFiles =
>>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>>
>>>> On this RDD I perform a map to process individual files and invoke a
>>>> foreach to trigger the same map.
>>>>
>>>>* JavaRDD<Object[]> output = inputDataFiles.map(new
>>>> Function<Tuple2<String,String>,Object[]>()*
>>>> *{*
>>>>
>>>> *private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>>>> *{ *
>>>> *  System.out.println("in map!");*
>>>> *   //do something with v1. *
>>>> *  return Object[]*
>>>> *} *
>>>> *});*
>>>>
>>>> *output.foreach(new VoidFunction<Object[]>() {*
>>>>
>>>> * private static final long serialVersionUID = 1L;*
>>>>
>>>> * @Override*
>>>> * public void call(Object[] t) throws Exception {*
>>>> * //do nothing!*
>>>> * System.out.println("in foreach!");*
>>>> * }*
>>>> * }); *
>>>>
>>>> This code works perfectly fine for standalone setup on my local laptop
>>>> while accessing both local files as well as remote HDFS files.
>>>>
>>>> In cluster the same code produces no results. My intuition is that the
>>>> data has not reached the individual executors and hence both the `map` and
>>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>>> out why this would not work in cluster. I dont even see the print
>>>> statements in `map` and `foreach` getting printed in cluster mode of
>>>> execution.
>>>>
>>>> I notice a particular line in standalone output that I do NOT see in
>>>> cluster execution.
>>>>
>>>> *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>>
>>>> I had a similar code with textFile() that worked earlier for individual
>>>> files on cluster. The issue is with wholeTextFiles() only.
>>>>
>>>> Please advise what is the best way to get this working or other
>>>> alternate ways.
>>>>
>>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>>> master as `yarn-client`.
>>>>
>>>> The action can be anything. Its just a dummy step to invoke the map. I
>>>> also tried *System.out.println("Count is:"+output.count());*, for
>>>> which I got the correct answer of `10`, since there were 10 files in the
>>>> folder, but still the map refuses to work.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>
>>> --
>>> Thanks,
>>> Sonal
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nisha Menon
>> BTech (CS) Sahrdaya CET,
>> MTech (CS) IIIT Banglore.
>>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.