Hi Prav, You are correct, thanks for the explanation. As per below link, I can see that Job's method internally calls to DistributedCache itself ( http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.addCacheFile%28java.net.URI%29) after ensuring state, I think that might be the reason. Here is one of the method:
1067 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1067> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#> public void <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#>addCacheFile(URI <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/net/URI.java#URI> uri) { 1068 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1068> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#> ensureState <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.ensureState%28org.apache.hadoop.mapreduce.Job.JobState%29>(JobState <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.JobState.0DEFINE>.DEFINE <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#Job.JobState.0DEFINE>); 1069 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1069> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#> DistributedCache.addCacheFile <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/filecache/DistributedCache.java#DistributedCache.addCacheFile%28java.net.URI%2Corg.apache.hadoop.conf.Configuration%29>(uri, conf <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/task/JobContextImpl.java#JobContextImpl.0conf>); 1070 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#1070> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache/hadoop/mapreduce/Job.java#> } Thanks Amit On Thu, Jan 30, 2014 at 6:19 PM, praveenesh kumar <[email protected]>wrote: > Hi Amit, > > Side data distribution is altogether a different concept at all. Its when > you set custom (key,value) pairs and use Job object for doing that, so that > you can use them in your mappers/reducers. It is good when you want to pass > some small information to your mappers/reducers like extra command line > arguments that is required by mappers/reducers. > We were not discussing Side data distribution at all. > > The question was DistributedCache gets deprecated, where we can find the > right methods which DistributedCache delivers. > If you see the DistributedCache class in MR v1 - > > https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html > > and compare it with Job class in MR v2 - > > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html > > You would see the methods of DistributedCache class has been added to Job > class. Since DistributedCache is deprecated, my guess was that we can use > Job class to use distributed cache using the same methods which > DistributedCache used to provide. > > Everything else is same, its just that you use Job class to set your files > for Distributed cache inside your job configuration. Well I am sorry. I > don't have any nice article as I said that I also did this as part of my > experiment and I was able to use it without any issues, so that's why I > suggested it. > > Since most of the developers still using MRv1 on hadoop 2.0, that is why > these changes have not been come into highlights so far. I am hoping a new > documentation on how to use MRv2 would come soon, but if you understand > MRv1, I don't see any reasons why can't you just move around a bit in API > and find your relevant classes that you want to use by yourself. Again, as > I said, I don't have any valid statements of what I am saying, they are > just the results of my own experiments, which you are most welcome to > conduct and play with. Happy Coding..!! > > Regards > Prav > > > > > On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal <[email protected]>wrote: > >> Hi Prav, >> >> Yes, you are correct that DistributedCache does not upload file into >> memory. Also using job configuration and DistributedCache are 2 different >> approaches. I am referring based on "Hadoop: The definitive guide" >> Chapter:8 > Side Data Distribution (Page 288-295). >> As you are saying that now methods of DistributedCache moved to Job, I >> request if you please share some article or document on that for my better >> understanding, it will be great help. >> >> Thanks >> Amit >> >> >> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar >> <[email protected]>wrote: >> >>> Hi Amit, >>> >>> I am not sure how are they linked with DistributedCache.. Job >>> configuration is not uploading any data in memory.. As far as I am aware of >>> how DistributedCache works, nothing get loaded in memory. Distributed cache >>> just copies the files into slave nodes, so that they are accessible to >>> mappers/reducers. Usually the location is >>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from >>> distribution to distribution) You always have to read the files in your >>> mapper or reducer when ever you want to use them. >>> >>> What has happened is the method of DistributedCache class has now been >>> added to Job class, and I am assuming they won't change the functionality >>> of how distributed cache methods used to work, otherwise there would have >>> been some nice articles on that, plus I don't see any reason of changing >>> that as well too.. so everything works still the same way.. Its just that >>> you use the new Job class to use distributed cache features. >>> >>> I am not sure what entries you are exactly pointing to. Am I missing >>> anything here ? >>> >>> >>> Regards >>> Prav >>> >>> >>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal <[email protected]>wrote: >>> >>>> Hi Mike & Prav, >>>> >>>> Although I am new to Hadoop, but would like to add my 2 cents if that >>>> helps. >>>> We are having 2 ways for distribution of shared data, one is using Job >>>> configuration and other is DistributedCache. >>>> As job configuration is read by the JT, TT and child JVMs, and each >>>> time the configuration is read, all of its entries are read in memory, even >>>> if they are not used. So using job configuration is not advised if the data >>>> is more than few kilobytes. So it is not alternative to DistributedCache >>>> unless some modifications are done in Job configuration to address this >>>> limitation. >>>> So I am also curious to know the alternatative to DistributedCache >>>> class. >>>> >>>> Thanks >>>> Amit >>>> >>>> >>>> >>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael < >>>> [email protected]> wrote: >>>> >>>>> I noticed that in Hadoop 2.2.0 >>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been >>>>> deprecated. >>>>> >>>>> >>>>> >>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class) >>>>> >>>>> >>>>> >>>>> Is there a class that provides equivalent functionality? My >>>>> application relies heavily on DistributedCache. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Mike G. >>>>> >>>>> This communication, along with its attachments, is considered >>>>> confidential and proprietary to Vistronix. It is intended only for the >>>>> use >>>>> of the person(s) named above. Note that unauthorized disclosure or >>>>> distribution of information not generally known to the public is strictly >>>>> prohibited. If you are not the intended recipient, please notify the >>>>> sender immediately. >>>>> >>>> >>>> >>> >> >
