Re: running map tasks in remote node

Harsh J Sun, 25 Aug 2013 08:08:43 -0700

In a multi-node mode, MR requires a distributed filesystem (such as
HDFS) to be able to run.


On Sun, Aug 25, 2013 at 7:59 PM, rab ra <[email protected]> wrote:
> Dear Yong,
>
> Thanks for your elaborate answer. Your answer really make sense and I am
> ending something close to it expect shared storage.
>
> In my usecase, I am not allowed to use any shared storage system. The reason
> being that the slave nodes may not be safe for hosting sensible data.
> (Because, they could belong to different enterprise, may be from cloud) I do
> agree that we still need this data on the slave node while doing processing
> and hence need to transfer the data from the enterprise node to the
> processing nodes. But that's ok as this is better than using the slave nodes
> for storage. If I can use shared storage then I could use hdfs itself. I
> wrote simple example code with 2 node cluster setup and was testing various
> input formats such as WholeFileInputFormat, NLineInputFormat,
> TextInputFormat. I faced issues when I do not want to use shared storage as
> I explained in my last email. I was thinking that having the input file in
> the master node (job tracker) is sufficient and it will send portion of the
> input file to the map process in the second node (slave). But this was not
> the case as the method setInputPath() (and map reduce system) expect this
> path is a shared one.  All these my observations lead to straightforward
> question that "Is map reduce system expect a shared storage system ? And
> that input directories need to be present in that shared system? Is there a
> workaround for this issue?". Infact,I am prepared to use hdfs just for
> convincing map reduce system and feed input to it. And for actual processing
> I shall end up transferring the required data files to the slave nodes.
>
> I do note that I cannot enjoy the advantages that comes with hdfs such as
> data replication, data location aware system etc.
>
>
> with thanks and regards
> rabmdu
>
>
>
>
>
>
>
> On Fri, Aug 23, 2013 at 7:41 PM, java8964 java8964 <[email protected]>
> wrote:
>>
>> It is possible to do what you are trying to do, but only make sense if
>> your MR job is very CPU intensive, and you want to use the CPU resource in
>> your cluster, instead of the IO.
>>
>> You may want to do some research about what is the HDFS's role in Hadoop.
>> First but not least, it provides a central storage for all the files will be
>> processed by MR jobs. If you don't want to use HDFS, so you need to
>> identify a share storage to be shared among all the nodes in your cluster.
>> HDFS is NOT required, but a shared storage is required in the cluster.
>>
>> For simply your question, let's just use NFS to replace HDFS. It is
>> possible for a POC to help you understand how to set it up.
>>
>> Assume your have a cluster with 3 nodes (one NN, two DN. The JT running on
>> NN, and TT running on DN). So instead of using HDFS, you can try to use NFS
>> by this way:
>>
>> 1) Mount /share_data in all of your 2 data nodes. They need to have the
>> same mount. So /share_data in each data node point to the same NFS location.
>> It doesn't matter where you host this NFS share, but just make sure each
>> data node mount it as the same /share_data
>> 2) Create a folder under /share_data, put all your data into that folder.
>> 3) When kick off your MR jobs, you need to give a full URL of the input
>> path, like 'file:///shared_data/myfolder', also a full URL of the output
>> path, like 'file:///shared_data/output'. In this way, each mapper will
>> understand that in fact they will run the data from local file system,
>> instead of HDFS. That's the reason you want to make sure each task node has
>> the same mount path, as 'file:///shared_data/myfolder' should work fine for
>> each  task node. Check this and make sure that /share_data/myfolder all
>> point to the same path in each of your task node.
>> 4) You want each mapper to process one file, so instead of using the
>> default 'TextInputFormat', use a 'WholeFileInputFormat', this will make sure
>> that every file under '/share_data/myfolder' won't be split and sent to the
>> same mapper processor.
>> 5) In the above set up, I don't think you need to start NameNode or
>> DataNode process any more, anyway you just use JobTracker and TaskTracker.
>> 6) Obviously when your data is big, the NFS share will be your bottleneck.
>> So maybe you can replace it with Share Network Storage, but above set up
>> gives you a start point.
>> 7) Keep in mind when set up like above, you lost the Data Replication,
>> Data Locality etc, that's why I said it ONLY makes sense if your MR job is
>> CPU intensive. You simple want to use the Mapper/Reducer tasks to process
>> your data, instead of any scalability of IO.
>>
>> Make sense?
>>
>> Yong
>>
>> ________________________________
>> Date: Fri, 23 Aug 2013 15:43:38 +0530
>> Subject: Re: running map tasks in remote node
>>
>> From: [email protected]
>> To: [email protected]
>>
>> Thanks for the reply.
>>
>> I am basically exploring possible ways to work with hadoop framework for
>> one of my use case. I have my limitations in using hdfs but agree with the
>> fact that using map reduce in conjunction with hdfs makes sense.
>>
>> I successfully tested wholeFileInputFormat by some googling.
>>
>> Now, coming to my use case. I would like to keep some files in my master
>> node and want to do some processing in the cloud nodes. The policy does not
>> allow us to configure and use cloud nodes as HDFS.  However, I would like to
>> span a map process in those nodes. Hence, I set input path as local file
>> system, for example, $HOME/inputs. I have a file listing filenames (10
>> lines) in this input directory.  I use NLineInputFormat and span 10 map
>> process. Each map process gets a line. The map process will then do a file
>> transfer and process it.  However, I get an error in the map saying that the
>> FileNotFoundException $HOME/inputs. I am sure this directory is present in
>> my master but not in the slave nodes. When I copy this input directory to
>> slave nodes, it works fine. I am not able to figure out how to fix this and
>> the reason for the error. I am not understand why it complains about the
>> input directory is not present. As far as I know, slave nodes get a map and
>> map method contains contents of the input file. This should be fine for the
>> map logic to work.
>>
>>
>> with regards
>> rabmdu
>>
>>
>>
>>
>> On Thu, Aug 22, 2013 at 4:40 PM, java8964 java8964 <[email protected]>
>> wrote:
>>
>> If you don't plan to use HDFS, what kind of sharing file system you are
>> going to use between cluster? NFS?
>> For what you want to do, even though it doesn't make too much sense, but
>> you need to the first problem as the shared file system.
>>
>> Second, if you want to process the files file by file, instead of block by
>> block in HDFS, then you need to use the WholeFileInputFormat (google this
>> how to write one). So you don't need a file to list all the files to be
>> processed, just put them into one folder in the sharing file system, then
>> send this folder to your MR job. In this way, as long as each node can
>> access it through some file system URL, each file will be processed in each
>> mapper.
>>
>> Yong
>>
>> ________________________________
>> Date: Wed, 21 Aug 2013 17:39:10 +0530
>> Subject: running map tasks in remote node
>> From: [email protected]
>> To: [email protected]
>>
>>
>> Hello,
>>
>> Here is the new bie question of the day.
>>
>> For one of my use cases, I want to use hadoop map reduce without HDFS.
>> Here, I will have a text file containing a list of file names to process.
>> Assume that I have 10 lines (10 files to process) in the input text file and
>> I wish to generate 10 map tasks and execute them in parallel in 10 nodes. I
>> started with basic tutorial on hadoop and could setup single node hadoop
>> cluster and successfully tested wordcount code.
>>
>> Now, I took two machines A (master) and B (slave). I did the below
>> configuration in these machines to setup a two node cluster.
>>
>> hdfs-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>> <property>
>>           <name>dfs.replication</name>
>>           <value>1</value>
>> </property>
>> <property>
>>   <name>dfs.name.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/name</value>
>> </property>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/tmp/hadoop-bala/dfs/data</value>
>> </property>
>> <property>
>>      <name>mapred.job.tracker</name>
>>     <value>A:9001</value>
>> </property>
>>
>> </configuration>
>>
>> mapred-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>             <name>mapred.job.tracker</name>
>>             <value>A:9001</value>
>> </property>
>> <property>
>>           <name>mapreduce.tasktracker.map.tasks.maximum</name>
>>            <value>1</value>
>> </property>
>> </configuration>
>>
>> core-site.xml
>>
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> <configuration>
>>          <property>
>>                 <name>fs.default.name</name>
>>                 <value>hdfs://A:9000</value>
>>         </property>
>> </configuration>
>>
>>
>> In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and
>> another file called ‘masters’ wherein an entry ‘A’ is there.
>>
>> I have kept my input file at A. I see the map method process the input
>> file line by line but they are all processed in A. Ideally, I would expect
>> those processing to take place in B.
>>
>> Can anyone highlight where I am going wrong?
>>
>>  regards
>> rab
>>
>>
>



-- 
Harsh J

Re: running map tasks in remote node

Reply via email to