Re: Copying to DistributedCache using -files

Miguel Paraz Thu, 19 Dec 2013 20:28:11 -0800

Hi Josh,

It's working now. Thanks for helping with my newbie question, and looking
at the code.


Confusing that omitting the new Configuration() works with the plain
MapReduce API.

Cheers,
Miguel


On Fri, Dec 20, 2013 at 3:38 AM, Josh Wills <[email protected]> wrote:

> Hey Miguel,
>
> You need to call:
>
> ToolRunner.run(new MaxmindCrunchJob(), args, new Configuration());
>
> in main() to pickup the args from the commandline.
>
> J
>
>
> On Thu, Dec 19, 2013 at 8:42 AM, Miguel Paraz <[email protected]> wrote:
>
>> Hi,
>> I'm studying Crunch with code that relies on the DistributedCache to copy
>> files to the local filesystem. (My code is at
>> https://bitbucket.org/mparaz/maxmind-crunch)
>>
>> I'm using 0.9.0-mapreduce2 on a 2.2.0 setup (Hortonworks Sandbox 2.0).
>>
>> I see that Crunch programs use the same pattern as low-level MapReduce,
>> with ToolRunner.run() and implementing Tool.run().
>>
>> Unfortunately, the file I specify with the "-files" parameter is not
>> copied.
>> I logged getConf().get("tmpfiles") and that configuration entry is there.
>>
>> At which point should the file copied? I looked through the Hadoop source
>> code and found that tmpfiles is processed in
>> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
>> - copyAndConfigureFiles()
>>
>> Is this code not invoked when Crunch is used?
>> This works with the equivalent MapReduce 2.2.0 API code.
>>
>> Is there are a working example with distributed files that I could try?
>>
>> Thanks!
>> Miguel
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Copying to DistributedCache using -files

Reply via email to