Re: Programtically invoking a Map/Reduce job

Mike Hugo Thu, 17 Jan 2013 12:42:41 -0800

Spot on - thanks Billie, that did the trick!




On Thu, Jan 17, 2013 at 1:57 PM, Billie Rinaldi <[email protected]> wrote:

> On Thu, Jan 17, 2013 at 11:16 AM, Mike Hugo <[email protected]> wrote:
>
>> Thanks Billie!
>>
>> Setting "mapred.job.tracker" and "fs.default.name" in the conf has
>> gotten me further.
>>
>>          job.getConfiguration().set("mapred.job.tracker",
>> "server_name_here:8021");
>>         job.getConfiguration().set("fs.default.name",
>> "hdfs://server_name_here:8020");
>>
>> What's interesting now is that the job can't find Accumulo classes - when
>> I run the job now, I get
>>
>> 2013-01-17 12:59:25,278 [main] INFO  mapred.JobClient  - Task Id :
>> attempt_201301171102_0012_m_000000_1, Status : FAILED
>> java.lang.RuntimeException: java.lang.ClassNotFoundException:
>> org.apache.accumulo.core.client.mapreduce.AccumuloRowInputFormat
>>
>> Is there a way to inform the job (via the Job API, on a separate machine
>> not running hadoop) about extra libs to include on the classpath of the job?
>>
>
> You normally inform a job about jars it needs by specifying "-libjars
> comma,separated,jar,list" on the command line.  In this case, you need to
> put those two strings "-libjars" and "jar,list" in the String[] args passed
> to ToolRunner.run:
> ToolRunner.run(CachedConfiguration.getInstance(), new ...(), args)
>
> The accumulo-core jar probably isn't the only one you'll need.
>
> Billie
>
>
>>
>> Thanks
>>
>> Mike
>>
>>
>>
>> On Wed, Jan 16, 2013 at 3:11 PM, Billie Rinaldi <[email protected]>wrote:
>>
>>> Your job is running in "local" mode (Running job: job_local_0001).  This
>>> basically means that the hadoop configuration is not present on the
>>> classpath of the java client kicking off the job.  If you weren't planning
>>> to have the hadoop config on that machine, you might be able to get away
>>> with setting "mapred.job.tracker" and probably also "fs.default.name"
>>> on the Configuration object.
>>>
>>> Billie
>>>
>>>
>>>
>>> On Wed, Jan 16, 2013 at 12:07 PM, Mike Hugo <[email protected]> wrote:
>>>
>>>> Cool, thanks for the feedback John, the examples have been helpful in
>>>> getting up and running!
>>>>
>>>> Perhaps I'm not doing something quite right.  When I jar up my jobs and
>>>> deploy the jar to the server and run it via the tool.sh command on the
>>>> cluster, I see the job running in the jobtracker (servername:50030) and it
>>>> runs as I would expect.
>>>>
>>>> 13/01/16 14:39:53 INFO mapred.JobClient: Running job:
>>>> job_201301161326_0006
>>>> 13/01/16 14:39:54 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 13/01/16 14:41:29 INFO mapred.JobClient:  map 50% reduce 0%
>>>> 13/01/16 14:41:35 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 13/01/16 14:41:40 INFO mapred.JobClient: Job complete:
>>>> job_201301161326_0006
>>>> 13/01/16 14:41:40 INFO mapred.JobClient: Counters: 18
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:   Job Counters
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=180309
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Total time spent by all
>>>> reduces waiting after reserving slots (ms)=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Total time spent by all
>>>> maps waiting after reserving slots (ms)=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Rack-local map tasks=2
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Launched map tasks=2
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:   File Output Format Counters
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Bytes Written=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:   FileSystemCounters
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     HDFS_BYTES_READ=248
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=60214
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:   File Input Format Counters
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Bytes Read=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Map input records=1036434
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Physical memory (bytes)
>>>> snapshot=373760000
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Spilled Records=0
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     CPU time spent (ms)=24410
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Total committed heap usage
>>>> (bytes)=168394752
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> snapshot=2124627968
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     Map output records=2462684
>>>> 13/01/16 14:41:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=248
>>>>
>>>>
>>>>
>>>> When I kick off a job via a java client running on a different host,
>>>> the job seems to run (I can see things being scanned and ingested) but I
>>>> don't see anything via the jobtracker UI on the server.  Is that normal?
>>>>  Or do I have something mis-configured?
>>>>
>>>>
>>>>
>>>> Here's how I'm starting things from the client:
>>>>
>>>>     @Override
>>>>     public int run(String[] strings) throws Exception {
>>>>         Job job = new Job(getConf(), getClass().getSimpleName());
>>>>         job.setJarByClass(getClass());
>>>>         job.setMapperClass(MyMapper.class);
>>>>
>>>>         job.setInputFormatClass(AccumuloRowInputFormat.class);
>>>>
>>>> AccumuloRowInputFormat.setZooKeeperInstance(job.getConfiguration(),
>>>> instanceName, zookeepers);
>>>>
>>>>         AccumuloRowInputFormat.setInputInfo(job.getConfiguration(),
>>>>                 username,
>>>>                 password.getBytes(),
>>>>                 "...",
>>>>                 new Authorizations());
>>>>
>>>>         job.setNumReduceTasks(0);
>>>>
>>>>         job.setOutputFormatClass(AccumuloOutputFormat.class);
>>>>         job.setOutputKeyClass(Key.class);
>>>>         job.setOutputValueClass(Mutation.class);
>>>>
>>>>         boolean createTables = true;
>>>>         String defaultTable = "...";
>>>>         AccumuloOutputFormat.setOutputInfo(job.getConfiguration(),
>>>>                 username,
>>>>                 password.getBytes(),
>>>>                 createTables,
>>>>                 defaultTable);
>>>>
>>>>
>>>> AccumuloOutputFormat.setZooKeeperInstance(job.getConfiguration(),
>>>> instanceName, zookeepers);
>>>>
>>>>         job.waitForCompletion(true);
>>>>
>>>>         return job.isSuccessful() ? 0 : 1;
>>>>     }
>>>>
>>>>     public static void main(String args[]) throws Exception {
>>>>         int res = ToolRunner.run(CachedConfiguration.getInstance(), new
>>>> ...(), args);
>>>>         System.exit(res);
>>>>     }
>>>>
>>>>
>>>>
>>>> Here's the output when I run it via the client application:
>>>>
>>>>
>>>> 2013-01-16 13:55:57,645 [main-SendThread()] INFO  zookeeper.ClientCnxn
>>>>  - Opening socket connection to server accumulo/10.1.10.160:2181
>>>> 2013-01-16 13:55:57,660 [main-SendThread(accumulo:2181)] INFO
>>>>  zookeeper.ClientCnxn  - Socket connection established to accumulo/
>>>> 10.1.10.160:2181, initiating session
>>>> 2013-01-16 13:55:57,671 [main-SendThread(accumulo:2181)] INFO
>>>>  zookeeper.ClientCnxn  - Session establishment complete on server accumulo/
>>>> 10.1.10.160:2181, sessionid = 0x13c449cfe010434, negotiated timeout =
>>>> 30000
>>>> 2013-01-16 13:55:58,379 [main] INFO  mapred.JobClient  - Running job:
>>>> job_local_0001
>>>> 2013-01-16 13:55:58,447 [Thread-16] INFO  mapred.Task  -  Using
>>>> ResourceCalculatorPlugin : null
>>>> 2013-01-16 13:55:59,383 [main] INFO  mapred.JobClient  -  map 0% reduce
>>>> 0%
>>>> 2013-01-16 13:56:04,458 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:07,459 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:10,461 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:13,462 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:16,463 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:19,465 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:21,783 [Thread-16] INFO  mapred.Task  -
>>>> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>>>> commiting
>>>> 2013-01-16 13:56:21,783 [Thread-16] INFO  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:21,784 [Thread-16] INFO  mapred.Task  - Task
>>>> 'attempt_local_0001_m_000000_0' done.
>>>> 2013-01-16 13:56:21,786 [Thread-16] INFO  mapred.Task  -  Using
>>>> ResourceCalculatorPlugin : null
>>>> 2013-01-16 13:56:22,423 [main] INFO  mapred.JobClient  -  map 100%
>>>> reduce 0%
>>>> 2013-01-16 13:56:27,788 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:28,440 [main] INFO  mapred.JobClient  -  map 50%
>>>> reduce 0%
>>>> 2013-01-16 13:56:30,790 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:33,791 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:36,792 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:39,793 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:42,794 [communication thread] INFO
>>>>  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:45,779 [Thread-16] INFO  mapred.Task  -
>>>> Task:attempt_local_0001_m_000001_0 is done. And is in the process of
>>>> commiting
>>>> 2013-01-16 13:56:45,780 [Thread-16] INFO  mapred.LocalJobRunner  -
>>>> 2013-01-16 13:56:45,781 [Thread-16] INFO  mapred.Task  - Task
>>>> 'attempt_local_0001_m_000001_0' done.
>>>> 2013-01-16 13:56:45,782 [Thread-16] WARN  mapred.FileOutputCommitter  -
>>>> Output path is null in cleanup
>>>> 2013-01-16 13:56:46,462 [main] INFO  mapred.JobClient  -  map 100%
>>>> reduce 0%
>>>> 2013-01-16 13:56:46,462 [main] INFO  mapred.JobClient  - Job complete:
>>>> job_local_0001
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  - Counters: 7
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -
>>>> FileSystemCounters
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -
>>>> FILE_BYTES_READ=1257
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -
>>>> FILE_BYTES_WRITTEN=106136
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -   Map-Reduce
>>>> Framework
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -     Map input
>>>> records=1036434
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -     Spilled
>>>> Records=0
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -     Total
>>>> committed heap usage (bytes)=259915776
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -     Map output
>>>> records=2462684
>>>> 2013-01-16 13:56:46,463 [main] INFO  mapred.JobClient  -
>>>> SPLIT_RAW_BYTES=240
>>>>
>>>>
>>>>
>>>> On Wed, Jan 16, 2013 at 11:20 AM, John Vines <[email protected]> wrote:
>>>>
>>>>> The code examples we have scripted simply do the necessary setup for
>>>>> creating a mapreduce job and kicking it off. If you check out the code for
>>>>> them in
>>>>> src/examples/simple/src/main/java/org/apache/accumulo/examples/simple/mapreduce/
>>>>> you can see what we're doing in Java to kick off jobs.
>>>>>
>>>>> The short explanation is, just like any other MapReduce job, we're
>>>>> setting up a Job, configuring the AccumuloInput and/or OutputFormats, and
>>>>> sending them off like any other MapReduce job.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On Wed, Jan 16, 2013 at 12:11 PM, Mike Hugo <[email protected]> wrote:
>>>>>
>>>>>> I'm writing a client program that uses the BatchWriter and
>>>>>> BatchScanner for inserting and querying data, but occasionally it also
>>>>>> needs to be able to kick of a Map/Reduce job on a remote accumulo 
>>>>>> cluster.
>>>>>>  The Map/Reduce examples that ship with Accumulo look like they are meant
>>>>>> to be invoked via the command line.  Does anyone have an example of how 
>>>>>> to
>>>>>> kick something off via a java client running on a separate server?  Any
>>>>>> best practices to share?
>>>>>> Thanks,
>>>>>> Mike
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Programtically invoking a Map/Reduce job

Reply via email to