How does wholeTextFiles() work in Spark-Hadoop Cluster?

Nisha Menon Wed, 21 Sep 2016 02:03:52 -0700

I looked at the driver logs, that reminded me that I needed to look at the
executor logs. There the issue was that the spark executors were not
getting a configuration file. I broadcasted the file and now the processing
happens. Thanks for the suggestion.
Currently my issue is that the log file generated independently by the
executors goes to the respective containers' appcache, and then it gets
lost. Is there a recommended way to get the output files from the
individual executors?


On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Are you looking at the worker logs or the driver?
>
>
> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
> wrote:
>
>> I have an RDD created as follows:
>>
>> *    JavaPairRDD<String,String> inputDataFiles =
>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>
>> On this RDD I perform a map to process individual files and invoke a
>> foreach to trigger the same map.
>>
>>    * JavaRDD<Object[]> output = inputDataFiles.map(new
>> Function<Tuple2<String,String>,Object[]>()*
>> *    {*
>>
>> *        private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>> *            { *
>> *              System.out.println("in map!");*
>> *               //do something with v1. *
>> *              return Object[]*
>> *            } *
>> *    });*
>>
>> *    output.foreach(new VoidFunction<Object[]>() {*
>>
>> * private static final long serialVersionUID = 1L;*
>>
>> * @Override*
>> * public void call(Object[] t) throws Exception {*
>> * //do nothing!*
>> * System.out.println("in foreach!");*
>> * }*
>> * }); *
>>
>> This code works perfectly fine for standalone setup on my local laptop
>> while accessing both local files as well as remote HDFS files.
>>
>> In cluster the same code produces no results. My intuition is that the
>> data has not reached the individual executors and hence both the `map` and
>> `foreach` does not work. It might be a guess. But I am not able to figure
>> out why this would not work in cluster. I dont even see the print
>> statements in `map` and `foreach` getting printed in cluster mode of
>> execution.
>>
>> I notice a particular line in standalone output that I do NOT see in
>> cluster execution.
>>
>>     *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>
>> I had a similar code with textFile() that worked earlier for individual
>> files on cluster. The issue is with wholeTextFiles() only.
>>
>> Please advise what is the best way to get this working or other alternate
>> ways.
>>
>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>> master as `yarn-client`.
>>
>> The action can be anything. Its just a dummy step to invoke the map. I
>> also tried *System.out.println("Count is:"+output.count());*, for which
>> I got the correct answer of `10`, since there were 10 files in the folder,
>> but still the map refuses to work.
>>
>> Thanks.
>>
>>
>
> --
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>


-- 
Nisha Menon
BTech (CS) Sahrdaya CET,
MTech (CS) IIIT Banglore.

How does wholeTextFiles() work in Spark-Hadoop Cluster?

Reply via email to