Re: DistributedCache deprecated

praveenesh kumar Thu, 30 Jan 2014 04:50:15 -0800

Hi Amit,

Side data distribution is altogether a different concept at all. Its when
you set custom (key,value) pairs and use Job object for doing that, so that
you can use them in your mappers/reducers. It is good when you want to pass
some small information to your mappers/reducers like extra command line
arguments that is required by mappers/reducers.
We were not discussing Side data distribution at all.


The question was DistributedCache gets deprecated, where we can find the
right methods which DistributedCache delivers.
If you see the DistributedCache class in MR v1 -
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/filecache/DistributedCache.html

and compare it with Job class in MR v2 -
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html

You would see the methods of DistributedCache class has been added to Job
class. Since DistributedCache is deprecated, my guess was that we can use
Job class to use distributed cache using the same methods which
DistributedCache used to provide.

Everything else is same, its just that you use Job class to set your files
for Distributed cache inside your job configuration. Well I am sorry. I
don't have any nice article as I said that I also did this as part of my
experiment and I was able to use it without any issues, so that's why I
suggested it.

Since most of the developers still using MRv1 on hadoop 2.0, that is why
these changes have not been come into highlights so far. I am hoping a new
documentation on how to use MRv2 would come soon, but if you understand
MRv1, I don't see any reasons why can't you just move around a bit in API
and find your relevant classes that you want to use by yourself.  Again, as
I said, I don't have any valid statements of what I am saying, they are
just the results of my own experiments, which you are most welcome to
conduct and play with. Happy Coding..!!

Regards
Prav




On Thu, Jan 30, 2014 at 12:27 PM, Amit Mittal <[email protected]> wrote:

> Hi Prav,
>
> Yes, you are correct that DistributedCache does not upload file into
> memory. Also using job configuration and DistributedCache are 2 different
> approaches. I am referring based on "Hadoop: The definitive guide"
> Chapter:8 > Side Data Distribution (Page 288-295).
> As you are saying that now methods of DistributedCache moved to Job, I
> request if you please share some article or document on that for my better
> understanding, it will be great help.
>
> Thanks
> Amit
>
>
> On Thu, Jan 30, 2014 at 5:35 PM, praveenesh kumar <[email protected]>wrote:
>
>> Hi Amit,
>>
>> I am not sure how are they linked with DistributedCache.. Job
>> configuration is not uploading any data in memory.. As far as I am aware of
>> how DistributedCache works, nothing get loaded in memory. Distributed cache
>> just copies the files into slave nodes, so that they are accessible to
>> mappers/reducers. Usually the location is
>> ${hadoop.tmp.dir}/${mapred.local.dir}/tasktracker/archive (depends from
>> distribution to distribution) You always have to read the files in your
>> mapper or reducer when ever you want to use them.
>>
>> What has happened is the method of DistributedCache class has now been
>> added to Job class, and I am assuming they won't change the functionality
>> of how distributed cache methods used to work, otherwise there would have
>> been some nice articles on that, plus I don't see any reason of changing
>> that as well too..  so everything works still the same way.. Its just that
>> you use the new Job class to use distributed cache features.
>>
>> I am not sure what entries you are exactly pointing to. Am I missing
>> anything here ?
>>
>>
>> Regards
>> Prav
>>
>>
>> On Thu, Jan 30, 2014 at 6:12 AM, Amit Mittal <[email protected]>wrote:
>>
>>> Hi Mike & Prav,
>>>
>>> Although I am new to Hadoop, but would like to add my 2 cents if that
>>> helps.
>>> We are having 2 ways for distribution of shared data, one is using Job
>>> configuration and other is DistributedCache.
>>> As job configuration is read by the JT, TT and child JVMs, and each time
>>> the configuration is read, all of its entries are read in memory, even if
>>> they are not used. So using job configuration is not advised if the data is
>>> more than few kilobytes. So it is not alternative to DistributedCache
>>> unless some modifications are done in Job configuration to address this
>>> limitation.
>>> So I am also curious to know the alternatative to DistributedCache class.
>>>
>>> Thanks
>>> Amit
>>>
>>>
>>>
>>> On Thu, Jan 30, 2014 at 2:43 AM, Giordano, Michael <
>>> [email protected]> wrote:
>>>
>>>>  I noticed that in Hadoop 2.2.0
>>>> org.apache.hadoop.mapreduce.filecache.DistributedCache has been deprecated.
>>>>
>>>>
>>>>
>>>> (http://hadoop.apache.org/docs/current/api/deprecated-list.html#class)
>>>>
>>>>
>>>>
>>>> Is there a class that provides equivalent functionality? My application
>>>> relies heavily on DistributedCache.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mike G.
>>>>
>>>> This communication, along with its attachments, is considered
>>>> confidential and proprietary to Vistronix.  It is intended only for the use
>>>> of the person(s) named above.  Note that unauthorized disclosure or
>>>> distribution of information not generally known to the public is strictly
>>>> prohibited.  If you are not the intended recipient, please notify the
>>>> sender immediately.
>>>>
>>>
>>>
>>
>

Re: DistributedCache deprecated

Reply via email to