In my suggestion, map or reduce tasks do not use distributed cache. They use file directly from HDFS with short circuit local read. Like a shared storage method, but almost every node has the data with high-replication factor.
Drake 민영근 Ph.D On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <[email protected]> wrote: > But stil if the model is very large enough, how can we load them inti > Distributed cache or some thing like that. > Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf > But it is confusing me > > On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <[email protected]> wrote: > >> Hi, >> >> How about this ? The large model data stay in HDFS but with many >> replications and MapReduce program read the model from HDFS. In theory, the >> replication factor of model data equals with number of data nodes and with >> the Short Circuit Local Reads function of HDFS datanode, the map or reduce >> tasks read the model data in their own disks. >> >> In this way, maybe use too many usage of HDFS, but the annoying partition >> problem will be gone. >> >> Thanks >> >> Drake 민영근 Ph.D >> >> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <[email protected]> >> wrote: >> >>> Is there any way.. >>> Waiting for a reply.I have posted the question every where..but none is >>> responding back. >>> I feel like this is the right place to ask doubts. As some of u may came >>> across the same issue and get stuck. >>> >>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>> [email protected]> wrote: >>> >>>> Yes, One of my friend is implemeting the same. I know global sharing of >>>> Data is not possible across Hadoop MapReduce. But I need to check if that >>>> can be done somehow in hadoop Mapreduce also. Because I found some papers >>>> in KNN hadoop also. >>>> And I trying to compare the performance too. >>>> >>>> Hope some pointers can help me. >>>> >>>> >>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <[email protected]> >>>> wrote: >>>> >>>>> >>>>> have you considered implementing using something like spark? That >>>>> could be much easier than raw map-reduce >>>>> >>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>> [email protected]> wrote: >>>>> >>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>> predicting the records. >>>>>> >>>>>> Here is the example for KNN. >>>>>> >>>>>> >>>>>> [image: Inline image 1] >>>>>> >>>>>> So if the model will be a large file say1 or 2 GB we will be able to >>>>>> load them into Distributed cache. >>>>>> >>>>>> The one way is to split/partition the model Result into some files >>>>>> and perform the distance calculation for all records in that file and >>>>>> then >>>>>> find the min ditance and max occurance of classlabel and predict the >>>>>> outcome. >>>>>> >>>>>> How can we parttion the file and perform the operation on these >>>>>> partition ? >>>>>> >>>>>> ie 1 record <Distance> parttition1,partition2,.... >>>>>> 2nd record <Distance> parttition1,partition2,... >>>>>> >>>>>> This is what came to my thought. >>>>>> >>>>>> Is there any further way. >>>>>> >>>>>> Any pointers would help me. >>>>>> >>>>>> -- >>>>>> *Thanks & Regards * >>>>>> >>>>>> >>>>>> *Unmesha Sreeveni U.B* >>>>>> *Hadoop, Bigdata Developer* >>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >>> >>> -- >>> *Thanks & Regards * >>> >>> >>> *Unmesha Sreeveni U.B* >>> *Hadoop, Bigdata Developer* >>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>> http://www.unmeshasreeveni.blogspot.in/ >>> >>> >>> >> > > > -- > *Thanks & Regards * > > > *Unmesha Sreeveni U.B* > *Hadoop, Bigdata Developer* > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > http://www.unmeshasreeveni.blogspot.in/ > > >
