Re: MR - Input from Hbase output to HDFS

Harsh J Mon, 21 Nov 2011 11:55:06 -0800

When using HBase, consider using the new API primarily.

The mapred.* package upstream in Hadoop is not deprecated anymore, however.


On 22-Nov-2011, at 1:21 AM, Denis Kreis wrote:

> Hi
> 
> Is org.apache.hadoop.mapred.FileInputFormat to be considered
> as obsolete/deprecated?
> 
> Thanks!
> 
> 2011/11/15 Stuti Awasthi <[email protected]>
> 
>> Sure Doug,
>> Thanks
>> 
>> -----Original Message-----
>> From: Doug Meil [mailto:[email protected]]
>> Sent: Monday, November 14, 2011 9:08 PM
>> To: [email protected]
>> Subject: Re: MR - Input from Hbase output to HDFS
>> 
>> 
>> Glad to worked through that and everything is working.  I will add an
>> example of MR to Hbase-to-HDFS in the book.
>> 
>> 
>> 
>> 
>> 
>> On 11/14/11 1:24 AM, "Stuti Awasthi" <[email protected]> wrote:
>> 
>>> Hi,
>>> I think that issue is with Filesystem Configuration, as in config, it
>>> is picking HbaseConfiguration. When I modified my output directory path
>>> to absolute path of HDFS :
>>> FileOutputFormat.setOutputPath(job, new
>>> Path("hdfs://master:54310/MR/stuti3"));
>>> 
>>> The MR jobs runs successfully and I am able to see stuti3 directory
>>> inside HDFS at desired path.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Stuti Awasthi
>>> Sent: Monday, November 14, 2011 11:40 AM
>>> To: [email protected]
>>> Subject: RE: MR - Input from Hbase output to HDFS
>>> 
>>> Hi Joey,
>>> Thanks for pointing this. After importing "FileOutputFormat" as you
>>> suggested, I am able to run MR job from eclipse (Windows) the only
>>> problem is I am not able to see the output directory this code is
>>> creating. HDFS and HBase are on Linux machine.
>>> 
>>> Code :
>>>              Configuration config = HBaseConfiguration.create();
>>>              config.set("hbase.zookeeper.quorum", "master");
>>>              config.set("hbase.zookeeper.property.clientPort", "2181");
>>> 
>>>              Job job = new Job(config, "Hbase_Read_Write");
>>>              job.setJarByClass(ReadWriteDriver.class);
>>>              Scan scan = new Scan();
>>>              scan.setCaching(500);
>>>              scan.setCacheBlocks(false);
>>>              TableMapReduceUtil.initTableMapperJob("users",
>>> scan,ReadWriteMapper.class, Text.class, IntWritable.class, job);
>>>              job.setOutputFormatClass(TextOutputFormat.class);
>>>              FileOutputFormat.setOutputPath(job, new Path("/stuti2"));
>>> 
>>> After executing this code, the MR jobs runs successfully but when I
>>> look hdfs no directory is created "/stuti2". I also looked directory in
>>> local filesystem of Linux machine as well as windows machine, but not
>>> able to find the output folder anywhere.
>>> 
>>> Eclipse console Output :
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.version=1.6.0_27
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.vendor=Sun Microsystems Inc.
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.home=C:\Program Files\Java\jdk1.6.0_27\jre
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.class.path=D:\workspace\Hbase\MRHbaseReadWrite\bin;D:\
>>> wor
>>> kspace\Hbase\MRHbaseReadWrite\lib\commons-cli-1.2.jar;D:\workspace\Hbas
>>> e\M
>>> RHbaseReadWrite\lib\commons-httpclient-3.0.1.jar;D:\workspace\Hbase\MRH
>>> bas
>>> eReadWrite\lib\commons-logging-1.0.4.jar;D:\workspace\Hbase\MRHbaseRead
>>> Wri
>>> te\lib\hadoop-0.20.2-core.jar;D:\workspace\Hbase\MRHbaseReadWrite\lib\h
>>> bas
>>> e-0.90.3.jar;D:\workspace\Hbase\MRHbaseReadWrite\lib\log4j-1.2.15.jar;D
>>> :\w orkspace\Hbase\MRHbaseReadWrite\lib\zookeeper-3.3.2.jar
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.library.path=C:\Program
>>> Files\Java\jdk1.6.0_27\jre\bin;C:\Windows\Sun\Java\bin;C:\Windows\syste
>>> m32 ;C:\Windows;C:/Program Files/Java/jre6/bin/client;C:/Program
>>> Files/Java/jre6/bin;C:/Program
>>> Files/Java/jre6/lib/i386;C:\Windows\system32;C:\Windows;C:\Windows\Syst
>>> em3 2\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
>>> Files\Java\jdk1.6.0_27;C:\Program
>>> Files\TortoiseSVN\bin;C:\cygwin\bin;D:\apache-maven-3.0.3\bin;D:\eclips
>>> e;;
>>> .
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.io.tmpdir=C:\Users\STUTIA~1\AppData\Local\Temp\
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:java.compiler=<NA>
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.name=Windows 7
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.arch=x86
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:os.version=6.1
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.name=stutiawasthi
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.home=C:\Users\stutiawasthi
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Client
>>> environment:user.dir=D:\workspace\Hbase\MRHbaseReadWrite
>>> 11/11/14 11:21:45 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:45 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ec, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO mapred.JobClient: Running job: job_local_0001
>>> 11/11/14 11:21:46 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ed, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO zookeeper.ZooKeeper: Initiating client
>>> connection,
>>> connectString=master:2181 sessionTimeout=180000 watcher=hconnection
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to server master/10.33.64.235:2181
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Socket connection
>>> established to master/10.33.64.235:2181, initiating session
>>> 11/11/14 11:21:46 INFO zookeeper.ClientCnxn: Session establishment
>>> complete on server master/10.33.64.235:2181, sessionid =
>>> 0x33879243de00ee, negotiated timeout = 180000
>>> 11/11/14 11:21:46 INFO mapred.MapTask: io.sort.mb = 100
>>> 11/11/14 11:21:46 INFO mapred.MapTask: data buffer = 79691776/99614720
>>> 11/11/14 11:21:46 INFO mapred.MapTask: record buffer = 262144/327680
>>> ...............................................
>>> 11/11/14 11:21:46 INFO mapred.MapTask: Finished spill 0
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>>> commiting
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> 'attempt_local_0001_m_000000_0' done.
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.Merger: Merging 1 sorted segments
>>> 11/11/14 11:21:46 INFO mapred.Merger: Down to the last merge-pass, with
>>> 1 segments left of total size: 103 bytes
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_r_000000_0 is done. And is in the process of
>>> commiting
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner:
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_r_000000_0 is allowed to commit now
>>> 11/11/14 11:21:46 INFO output.FileOutputCommitter: Saved output of task
>>> 'attempt_local_0001_r_000000_0' to /stuti2
>>> 11/11/14 11:21:46 INFO mapred.LocalJobRunner: reduce > reduce
>>> 11/11/14 11:21:46 INFO mapred.TaskRunner: Task
>>> 'attempt_local_0001_r_000000_0' done.
>>> 11/11/14 11:21:47 INFO mapred.JobClient:  map 100% reduce 100%
>>> 11/11/14 11:21:47 INFO mapred.JobClient: Job complete: job_local_0001
>>> 11/11/14 11:21:47 INFO mapred.JobClient: Counters: 12
>>> 11/11/14 11:21:47 INFO mapred.JobClient:   FileSystemCounters
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     FILE_BYTES_READ=40923
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=82343
>>> 11/11/14 11:21:47 INFO mapred.JobClient:   Map-Reduce Framework
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce input groups=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Combine output records=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map input records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce output records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Spilled Records=10
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map output bytes=91
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Combine input records=0
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Map output records=5
>>> 11/11/14 11:21:47 INFO mapred.JobClient:     Reduce input records=5
>>> 
>>> 
>>> Please Suggest
>>> 
>>> -----Original Message-----
>>> From: Joey Echeverria [mailto:[email protected]]
>>> Sent: Friday, November 11, 2011 10:38 PM
>>> To: [email protected]
>>> Subject: Re: MR - Input from Hbase output to HDFS
>>> 
>>> There are two APIs (old and new), and you appear to be mixing them.
>>> TableMapReduceUtil only works with the new API. The solution is to
>>> import the new version of FileOutputFormat which takes a Job:
>>> 
>>> 
>>> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
>>> 
>>> -Joey
>>> 
>>> On Fri, Nov 11, 2011 at 12:55 AM, Stuti Awasthi <[email protected]>
>>> wrote:
>>>> The method " setOutputPath (JobConf,Path)" take JobConf as a
>>>> parameter not the Job object.
>>>> At least this is the error Im getting while compiling with Hadoop
>>>> 0.20.2 jar with eclipse.
>>>> 
>>>> FileOutputFormat.setOutputPath(conf, new Path("/output"));
>>>> 
>>>> -----Original Message-----
>>>> From: Prashant Sharma [mailto:[email protected]]
>>>> Sent: Friday, November 11, 2011 11:20 AM
>>>> To: [email protected]
>>>> Subject: Re: MR - Input from Hbase output to HDFS
>>>> 
>>>> Hi stuti,
>>>> I was wondering why  you are not using job object to set output path
>>>> like this.
>>>> 
>>>> FileOutputFormat.setOutputPath(job, new Path("outputReadWrite") );
>>>> 
>>>> 
>>>> thanks
>>>> 
>>>> On Fri, Nov 11, 2011 at 10:43 AM, Stuti Awasthi
>>>> <[email protected]>wrote:
>>>> 
>>>>> Hi Andrie,
>>>>> Well I am bit confused. When I use Jobconf , and associate with
>>>>> JobClient to run the job then I get the error that "Input directory
>>>>> is not set".
>>>>> Since I want my input to be taken by Hbase table which I already
>>>>> configured with "TableMapReduceUtil.initTableMapperJob". I don't want
>>>>> to set input directory via jobconf.
>>>>> How to mix these 2 so that I can get input from Hbase and write
>>>>> ouput  to HDFS.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Andrei Cojocaru [mailto:[email protected]]
>>>>> Sent: Thursday, November 10, 2011 7:09 PM
>>>>> To: [email protected]
>>>>> Subject: Re: MR - Input from Hbase output to HDFS
>>>>> 
>>>>> Stuti,
>>>>> 
>>>>> I don't see you associating JobConf with Job anywhere.
>>>>> -Andrei
>>>>> 
>>>>> ::DISCLAIMER::
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> -------------------------------------------------
>>>>> 
>>>>> The contents of this e-mail and any attachment(s) are confidential
>>>>> and intended for the named recipient(s) only.
>>>>> It shall not attach any liability on the originator or HCL or its
>>>>> affiliates. Any views or opinions presented in this email are solely
>>>>> those of the author and may not necessarily reflect the opinions of
>>>>> HCL or its affiliates.
>>>>> Any form of reproduction, dissemination, copying, disclosure,
>>>>> modification, distribution and / or publication of this message
>>>>> without the prior written consent of the author of this e-mail is
>>>>> strictly prohibited. If you have received this email in error please
>>>>> delete it and notify the sender immediately. Before opening any mail
>>>>> and attachments please check them for viruses and defect.
>>>>> 
>>>>> 
>>>>> --------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> -------------------------------------------------
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>> 
>> 
>> 
>>

Re: MR - Input from Hbase output to HDFS

Reply via email to