Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

ayan guha Wed, 21 Sep 2016 03:31:18 -0700

On yarn, logs are aggregated from each containers to hdfs. You can use yarn
CLI or ui to view. For spark, you would have a history server which
consolidate s the logs
On 21 Sep 2016 19:03, "Nisha Menon" <nisha.meno...@gmail.com> wrote:


> I looked at the driver logs, that reminded me that I needed to look at the
> executor logs. There the issue was that the spark executors were not
> getting a configuration file. I broadcasted the file and now the processing
> happens. Thanks for the suggestion.
> Currently my issue is that the log file generated independently by the
> executors goes to the respective containers' appcache, and then it gets
> lost. Is there a recommended way to get the output files from the
> individual executors?
>
> On Thu, Sep 8, 2016 at 12:32 PM, Sonal Goyal <sonalgoy...@gmail.com>
> wrote:
>
>> Are you looking at the worker logs or the driver?
>>
>>
>> On Thursday, September 8, 2016, Nisha Menon <nisha.meno...@gmail.com>
>> wrote:
>>
>>> I have an RDD created as follows:
>>>
>>> *    JavaPairRDD<String,String> inputDataFiles =
>>> sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");*
>>>
>>> On this RDD I perform a map to process individual files and invoke a
>>> foreach to trigger the same map.
>>>
>>>    * JavaRDD<Object[]> output = inputDataFiles.map(new
>>> Function<Tuple2<String,String>,Object[]>()*
>>> *    {*
>>>
>>> *        private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public Object[] call(Tuple2<String,String> v1) throws Exception *
>>> *            { *
>>> *              System.out.println("in map!");*
>>> *               //do something with v1. *
>>> *              return Object[]*
>>> *            } *
>>> *    });*
>>>
>>> *    output.foreach(new VoidFunction<Object[]>() {*
>>>
>>> * private static final long serialVersionUID = 1L;*
>>>
>>> * @Override*
>>> * public void call(Object[] t) throws Exception {*
>>> * //do nothing!*
>>> * System.out.println("in foreach!");*
>>> * }*
>>> * }); *
>>>
>>> This code works perfectly fine for standalone setup on my local laptop
>>> while accessing both local files as well as remote HDFS files.
>>>
>>> In cluster the same code produces no results. My intuition is that the
>>> data has not reached the individual executors and hence both the `map` and
>>> `foreach` does not work. It might be a guess. But I am not able to figure
>>> out why this would not work in cluster. I dont even see the print
>>> statements in `map` and `foreach` getting printed in cluster mode of
>>> execution.
>>>
>>> I notice a particular line in standalone output that I do NOT see in
>>> cluster execution.
>>>
>>>     *16/09/07 17:35:35 INFO WholeTextFileRDD: Input split:
>>> Paths:/user/cdhuser/inputFolder/data1.txt:0+657345,/user/cdhuser/inputFolder/data10.txt:0+657345,/user/cdhuser/inputFolder/data2.txt:0+657345,/user/cdhuser/inputFolder/data3.txt:0+657345,/user/cdhuser/inputFolder/data4.txt:0+657345,/user/cdhuser/inputFolder/data5.txt:0+657345,/user/cdhuser/inputFolder/data6.txt:0+657345,/user/cdhuser/inputFolder/data7.txt:0+657345,/user/cdhuser/inputFolder/data8.txt:0+657345,/user/cdhuser/inputFolder/data9.txt:0+657345*
>>>
>>> I had a similar code with textFile() that worked earlier for individual
>>> files on cluster. The issue is with wholeTextFiles() only.
>>>
>>> Please advise what is the best way to get this working or other
>>> alternate ways.
>>>
>>> My setup is cloudera 5.7 distribution with Spark Service. I used the
>>> master as `yarn-client`.
>>>
>>> The action can be anything. Its just a dummy step to invoke the map. I
>>> also tried *System.out.println("Count is:"+output.count());*, for which
>>> I got the correct answer of `10`, since there were 10 files in the folder,
>>> but still the map refuses to work.
>>>
>>> Thanks.
>>>
>>>
>>
>> --
>> Thanks,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>> <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>
>
> --
> Nisha Menon
> BTech (CS) Sahrdaya CET,
> MTech (CS) IIIT Banglore.
>

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

Reply via email to