It is possible to do what you are trying to do, but only make sense if your MR 
job is very CPU intensive, and you want to use the CPU resource in your 
cluster, instead of the IO.
You may want to do some research about what is the HDFS's role in Hadoop. First 
but not least, it provides a central storage for all the files will be 
processed by MR jobs. If you don't want to use HDFS, so you need to  identify a 
share storage to be shared among all the nodes in your cluster. HDFS is NOT 
required, but a shared storage is required in the cluster.
For simply your question, let's just use NFS to replace HDFS. It is possible 
for a POC to help you understand how to set it up.
Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on NN, 
and TT running on DN). So instead of using HDFS, you can try to use NFS by this 
way:
1) Mount /share_data in all of your 2 data nodes. They need to have the same 
mount. So /share_data in each data node point to the same NFS location. It 
doesn't matter where you host this NFS share, but just make sure each data node 
mount it as the same /share_data2) Create a folder under /share_data, put all 
your data into that folder.3) When kick off your MR jobs, you need to give a 
full URL of the input path, like 'file:///shared_data/myfolder', also a full 
URL of the output path, like 'file:///shared_data/output'. In this way, each 
mapper will understand that in fact they will run the data from local file 
system, instead of HDFS. That's the reason you want to make sure each task node 
has the same mount path, as 'file:///shared_data/myfolder' should work fine for 
each  task node. Check this and make sure that /share_data/myfolder all point 
to the same path in each of your task node.4) You want each mapper to process 
one file, so instead of using the default 'TextInputFormat', use a 
'WholeFileInputFormat', this will make sure that every file under 
'/share_data/myfolder' won't be split and sent to the same mapper processor. 5) 
In the above set up, I don't think you need to start NameNode or DataNode 
process any more, anyway you just use JobTracker and TaskTracker.6) Obviously 
when your data is big, the NFS share will be your bottleneck. So maybe you can 
replace it with Share Network Storage, but above set up gives you a start 
point.7) Keep in mind when set up like above, you lost the Data Replication, 
Data Locality etc, that's why I said it ONLY makes sense if your MR job is CPU 
intensive. You simple want to use the Mapper/Reducer tasks to process your 
data, instead of any scalability of IO.
Make sense?
Yong

Date: Fri, 23 Aug 2013 15:43:38 +0530
Subject: Re: running map tasks in remote node
From: [email protected]
To: [email protected]

Thanks for the reply. 
I am basically exploring possible ways to work with hadoop framework for one of 
my use case. I have my limitations in using hdfs but agree with the fact that 
using map reduce in conjunction with hdfs makes sense.  

I successfully tested wholeFileInputFormat by some googling. 
Now, coming to my use case. I would like to keep some files in my master node 
and want to do some processing in the cloud nodes. The policy does not allow us 
to configure and use cloud nodes as HDFS.  However, I would like to span a map 
process in those nodes. Hence, I set input path as local file system, for 
example, $HOME/inputs. I have a file listing filenames (10 lines) in this input 
directory.  I use NLineInputFormat and span 10 map process. Each map process 
gets a line. The map process will then do a file transfer and process it.  
However, I get an error in the map saying that the FileNotFoundException 
$HOME/inputs. I am sure this directory is present in my master but not in the 
slave nodes. When I copy this input directory to slave nodes, it works fine. I 
am not able to figure out how to fix this and the reason for the error. I am 
not understand why it complains about the input directory is not present. As 
far as I know, slave nodes get a map and map method contains contents of the 
input file. This should be fine for the map logic to work.


with regardsrabmdu



On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <[email protected]> wrote:




If you don't plan to use HDFS, what kind of sharing file system you are going 
to use between cluster? NFS?For what you want to do, even though it doesn't 
make too much sense, but you need to the first problem as the shared file 
system.

Second, if you want to process the files file by file, instead of block by 
block in HDFS, then you need to use the WholeFileInputFormat (google this how 
to write one). So you don't need a file to list all the files to be processed, 
just put them into one folder in the sharing file system, then send this folder 
to your MR job. In this way, as long as each node can access it through some 
file system URL, each file will be processed in each mapper.

Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: [email protected]
To: [email protected]


Hello, 

Here is the new bie question of the day. For one of my use cases, I want to use 
hadoop map reduce without HDFS. Here, I will have a text file containing a list 
of file names to process. Assume that I have 10 lines (10 files to process) in 
the input text file and I wish to generate 10 map tasks and execute them in 
parallel in 10 nodes. I started with basic tutorial on hadoop and could setup 
single node hadoop cluster and successfully tested wordcount code.

 Now, I took two machines A (master) and B (slave). I did the below 
configuration in these machines to setup a two node cluster.

 hdfs-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put 
site-specific property overrides in this file. -->

<configuration><property>

          <name>dfs.replication</name>          <value>1</value>

</property><property>

  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>

</property><property>

  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>

</property><property>

     <name>mapred.job.tracker</name>    <value>A:9001</value>

</property> 

</configuration> mapred-site.xml

 <?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<!-- Put site-specific property overrides in this file. --> 

<configuration><property>

            <name>mapred.job.tracker</name>            <value>A:9001</value>

</property><property>

          <name>mapreduce.tasktracker.map.tasks.maximum</name>
           <value>1</value>
</property></configuration>

 core-site.xml 

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. --><configuration>

         <property>                <name>fs.default.name</name>

                <value>hdfs://A:9000</value>        </property>

</configuration> 

 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and 
another file called ‘masters’ wherein an entry ‘A’ is there.

 I have kept my input file at A. I see the map method process the input file 
line by line but they are all processed in A. Ideally, I would expect those 
processing to take place in B.

 Can anyone highlight where I am going wrong?

  regardsrab
                                          

                                          

Reply via email to