In a multi-node mode, MR requires a distributed filesystem (such as HDFS) to be able to run.
On Sun, Aug 25, 2013 at 7:59 PM, rab ra <[email protected]> wrote: > Dear Yong, > > Thanks for your elaborate answer. Your answer really make sense and I am > ending something close to it expect shared storage. > > In my usecase, I am not allowed to use any shared storage system. The reason > being that the slave nodes may not be safe for hosting sensible data. > (Because, they could belong to different enterprise, may be from cloud) I do > agree that we still need this data on the slave node while doing processing > and hence need to transfer the data from the enterprise node to the > processing nodes. But that's ok as this is better than using the slave nodes > for storage. If I can use shared storage then I could use hdfs itself. I > wrote simple example code with 2 node cluster setup and was testing various > input formats such as WholeFileInputFormat, NLineInputFormat, > TextInputFormat. I faced issues when I do not want to use shared storage as > I explained in my last email. I was thinking that having the input file in > the master node (job tracker) is sufficient and it will send portion of the > input file to the map process in the second node (slave). But this was not > the case as the method setInputPath() (and map reduce system) expect this > path is a shared one. All these my observations lead to straightforward > question that "Is map reduce system expect a shared storage system ? And > that input directories need to be present in that shared system? Is there a > workaround for this issue?". Infact,I am prepared to use hdfs just for > convincing map reduce system and feed input to it. And for actual processing > I shall end up transferring the required data files to the slave nodes. > > I do note that I cannot enjoy the advantages that comes with hdfs such as > data replication, data location aware system etc. > > > with thanks and regards > rabmdu > > > > > > > > On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <[email protected]> > wrote: >> >> It is possible to do what you are trying to do, but only make sense if >> your MR job is very CPU intensive, and you want to use the CPU resource in >> your cluster, instead of the IO. >> >> You may want to do some research about what is the HDFS's role in Hadoop. >> First but not least, it provides a central storage for all the files will be >> processed by MR jobs. If you don't want to use HDFS, so you need to >> identify a share storage to be shared among all the nodes in your cluster. >> HDFS is NOT required, but a shared storage is required in the cluster. >> >> For simply your question, let's just use NFS to replace HDFS. It is >> possible for a POC to help you understand how to set it up. >> >> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on >> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS >> by this way: >> >> 1) Mount /share_data in all of your 2 data nodes. They need to have the >> same mount. So /share_data in each data node point to the same NFS location. >> It doesn't matter where you host this NFS share, but just make sure each >> data node mount it as the same /share_data >> 2) Create a folder under /share_data, put all your data into that folder. >> 3) When kick off your MR jobs, you need to give a full URL of the input >> path, like 'file:///shared_data/myfolder', also a full URL of the output >> path, like 'file:///shared_data/output'. In this way, each mapper will >> understand that in fact they will run the data from local file system, >> instead of HDFS. That's the reason you want to make sure each task node has >> the same mount path, as 'file:///shared_data/myfolder' should work fine for >> each task node. Check this and make sure that /share_data/myfolder all >> point to the same path in each of your task node. >> 4) You want each mapper to process one file, so instead of using the >> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure >> that every file under '/share_data/myfolder' won't be split and sent to the >> same mapper processor. >> 5) In the above set up, I don't think you need to start NameNode or >> DataNode process any more, anyway you just use JobTracker and TaskTracker. >> 6) Obviously when your data is big, the NFS share will be your bottleneck. >> So maybe you can replace it with Share Network Storage, but above set up >> gives you a start point. >> 7) Keep in mind when set up like above, you lost the Data Replication, >> Data Locality etc, that's why I said it ONLY makes sense if your MR job is >> CPU intensive. You simple want to use the Mapper/Reducer tasks to process >> your data, instead of any scalability of IO. >> >> Make sense? >> >> Yong >> >> ________________________________ >> Date: Fri, 23 Aug 2013 15:43:38 +0530 >> Subject: Re: running map tasks in remote node >> >> From: [email protected] >> To: [email protected] >> >> Thanks for the reply. >> >> I am basically exploring possible ways to work with hadoop framework for >> one of my use case. I have my limitations in using hdfs but agree with the >> fact that using map reduce in conjunction with hdfs makes sense. >> >> I successfully tested wholeFileInputFormat by some googling. >> >> Now, coming to my use case. I would like to keep some files in my master >> node and want to do some processing in the cloud nodes. The policy does not >> allow us to configure and use cloud nodes as HDFS. However, I would like to >> span a map process in those nodes. Hence, I set input path as local file >> system, for example, $HOME/inputs. I have a file listing filenames (10 >> lines) in this input directory. I use NLineInputFormat and span 10 map >> process. Each map process gets a line. The map process will then do a file >> transfer and process it. However, I get an error in the map saying that the >> FileNotFoundException $HOME/inputs. I am sure this directory is present in >> my master but not in the slave nodes. When I copy this input directory to >> slave nodes, it works fine. I am not able to figure out how to fix this and >> the reason for the error. I am not understand why it complains about the >> input directory is not present. As far as I know, slave nodes get a map and >> map method contains contents of the input file. This should be fine for the >> map logic to work. >> >> >> with regards >> rabmdu >> >> >> >> >> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <[email protected]> >> wrote: >> >> If you don't plan to use HDFS, what kind of sharing file system you are >> going to use between cluster? NFS? >> For what you want to do, even though it doesn't make too much sense, but >> you need to the first problem as the shared file system. >> >> Second, if you want to process the files file by file, instead of block by >> block in HDFS, then you need to use the WholeFileInputFormat (google this >> how to write one). So you don't need a file to list all the files to be >> processed, just put them into one folder in the sharing file system, then >> send this folder to your MR job. In this way, as long as each node can >> access it through some file system URL, each file will be processed in each >> mapper. >> >> Yong >> >> ________________________________ >> Date: Wed, 21 Aug 2013 17:39:10 +0530 >> Subject: running map tasks in remote node >> From: [email protected] >> To: [email protected] >> >> >> Hello, >> >> Here is the new bie question of the day. >> >> For one of my use cases, I want to use hadoop map reduce without HDFS. >> Here, I will have a text file containing a list of file names to process. >> Assume that I have 10 lines (10 files to process) in the input text file and >> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I >> started with basic tutorial on hadoop and could setup single node hadoop >> cluster and successfully tested wordcount code. >> >> Now, I took two machines A (master) and B (slave). I did the below >> configuration in these machines to setup a two node cluster. >> >> hdfs-site.xml >> >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> <!-- Put site-specific property overrides in this file. --> >> <configuration> >> <property> >> <name>dfs.replication</name> >> <value>1</value> >> </property> >> <property> >> <name>dfs.name.dir</name> >> <value>/tmp/hadoop-bala/dfs/name</value> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/tmp/hadoop-bala/dfs/data</value> >> </property> >> <property> >> <name>mapred.job.tracker</name> >> <value>A:9001</value> >> </property> >> >> </configuration> >> >> mapred-site.xml >> >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> <!-- Put site-specific property overrides in this file. --> >> >> <configuration> >> <property> >> <name>mapred.job.tracker</name> >> <value>A:9001</value> >> </property> >> <property> >> <name>mapreduce.tasktracker.map.tasks.maximum</name> >> <value>1</value> >> </property> >> </configuration> >> >> core-site.xml >> >> <?xml version="1.0"?> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> <!-- Put site-specific property overrides in this file. --> >> <configuration> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://A:9000</value> >> </property> >> </configuration> >> >> >> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and >> another file called ‘masters’ wherein an entry ‘A’ is there. >> >> I have kept my input file at A. I see the map method process the input >> file line by line but they are all processed in A. Ideally, I would expect >> those processing to take place in B. >> >> Can anyone highlight where I am going wrong? >> >> regards >> rab >> >> > -- Harsh J
