Figured out 1. The output of the reduce was going to the slave node, while I was looking for it in the master node. Which is perfectly fine. Need guidance for 2. though!
Thanks Atish On Wed, Jan 8, 2014 at 3:30 PM, Atish Kathpal <[email protected]>wrote: > Hi > > By giving the complete URI, the MR jobs worked across both nodes. Thanks a > lot for the advice. > > *Two issues though*: > 1. On completion of the MR job, I see only the "_SUCCESS" file in the > output directory, but no part-r file containing the actual results of the > wordcount job. However I am seeing the correct output on running MR over > HDFS. What is going wrong? Any place I can find logs for the MR job. I see > no errors on the console. > Command used: > hadoop jar > /home/hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar > wordcount file:///home/hduser/testmount/ file:///home/hduser/testresults/ > > > 2. I am observing that the mappers seem to be accessing files > sequentially, splitting the files across mappers, and then reading data in > parallelel, then moving on to the next file. What I want instead is that, > files themselves should be accessed in parallel, that is, if there are 10 > files to be MRed, then MR should ask for each of these files in parallel in > one go, and then work on the splits of these files in parallel. > *Why do I need this?* Some of the data coming from the NFS mount point is > coming from offline media (which takes ~5-10 seconds of time before first > bytes are received). So I would like all required files to be asked at the > onset itself from the NFS mount point. This way several offline media will > be spun up parallely and as the data from these media gets available MR can > process them. > > Would be glad to get inputs on these points! > > Thanks > Atish > > Tip for those who are trying similar stuff:: > In my case. after a while the jobs would fail, complaining of > "java.lang.OutOfMemoryError: > Java heap > space<http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space>", > but I was able to rectify this with help from: > http://stackoverflow.com/questions/13674190/cdh-4-1-error-running-child-java-lang-outofmemoryerror-java-heap-space > > > > > > On Sun, Dec 22, 2013 at 2:47 PM, Atish Kathpal <[email protected]>wrote: > >> Thanks Devin, Yong, and Chris for your replies and suggestions. I will >> test the suggestions made by Yong and Devin and get back to you guys. >> >> As on the bottlenecking issue, I agree, but I am trying to run few MR >> jobs on a traditional NAS server. I can live with a few bottlenecks, so >> long as I don't have to move the data to a dedicated HDFS cluster. >> >> >> On Sat, Dec 21, 2013 at 8:06 AM, Chris Mawata <[email protected]>wrote: >> >>> Yong raises an important issue: You have thrown out the I/O >>> advantages of HDFS and also thrown out the advantages of data locality. It >>> would be interesting to know why you are taking this approach. >>> Chris >>> >>> >>> On 12/20/2013 9:28 AM, java8964 wrote: >>> >>> I believe the "-fs local" should be removed too. The reason is that even >>> you have a dedicated JobTracker after removing "-jt local", but with "-fs >>> local", I believe that all the mappers will be run sequentially. >>> >>> "-fs local" will force the mapreducer run in "local" mode, which is >>> really a test mode. >>> >>> What you can do is to remove both "-fs local -jt local", but give the >>> FULL URI of the input and output path, to tell Hadoop that they are local >>> filesystem instead of HDFS. >>> >>> "hadoop jar >>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar >>> wordcount file:///hduser/mount_point file:///results" >>> >>> Keep in mind followings: >>> >>> 1) The NFS mount need to be available in all your Task Nodes, and >>> mounted in the same way. >>> 2) Even you can do that, but your sharing storage will be your >>> bottleneck. NFS won't work well for scalability. >>> >>> Yong >>> >>> ------------------------------ >>> Date: Fri, 20 Dec 2013 09:01:32 -0500 >>> Subject: Re: Running Hadoop v2 clustered mode MR on an NFS mounted >>> filesystem >>> From: [email protected] >>> To: [email protected] >>> >>> I think most of your problem is coming from the options you are setting: >>> >>> "hadoop jar >>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar >>> wordcount *-fs local -jt local* /hduser/mount_point/ /results" >>> >>> You appear to be directing your namenode to run jobs in the *LOCAL* job >>> runner and directing it to read from the *LOCAL* filesystem. Drop the >>> *-jt* argument and it should run in distributed mode if your cluster is >>> set up right. You don't need to do anything special to point Hadoop towards >>> a NFS location, other than set up the NFS location properly and make sure >>> if you are directing to it by name that it will resolve to the right >>> address. Hadoop doesn't care where it is, as long as it can read from and >>> write to it. The fact that you are telling it to read/write from/to a NFS >>> location that happens to be mounted as a local filesystem object doesn't >>> matter - you could direct it to the local /hduser/ path and set the -fs >>> local option, and it would end up on the NFS mount, because that's where >>> the NFS mount actually exists, or you could direct it to the absolute >>> network location of the folder that you want, it shouldn't make a >>> difference. >>> >>> *Devin Suiter* >>> Jr. Data Solutions Software Engineer >>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>> Google Voice: 412-256-8556 | www.rdx.com >>> >>> >>> On Fri, Dec 20, 2013 at 5:27 AM, Atish Kathpal >>> <[email protected]>wrote: >>> >>> Hello >>> >>> The picture below describes the deployment architecture I am trying to >>> achieve. >>> However, when I run the wordcount example code with the below >>> configuration, by issuing the command from the master node, I notice only >>> the master node spawning map tasks and completing the submitted job. Below >>> is the command I used: >>> >>> *hadoop jar >>> /hduser/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar >>> wordcount -fs local -jt local /hduser/mount_point/ /results* >>> >>> *Question: How can I leverage both the hadoop nodes for running MR, >>> while serving my data from the common NFS mount point running my filesystem >>> at the backend? Has any one tried such a setup before?* >>> [image: Inline image 1] >>> >>> Thanks! >>> >>> >>> >>> >> >
<<image/png>>
