But stil if the model is very large enough, how can we load them inti Distributed cache or some thing like that. Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf But it is confusing me
On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <[email protected]> wrote: > Hi, > > How about this ? The large model data stay in HDFS but with many > replications and MapReduce program read the model from HDFS. In theory, the > replication factor of model data equals with number of data nodes and with > the Short Circuit Local Reads function of HDFS datanode, the map or reduce > tasks read the model data in their own disks. > > In this way, maybe use too many usage of HDFS, but the annoying partition > problem will be gone. > > Thanks > > Drake 민영근 Ph.D > > On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <[email protected]> > wrote: > >> Is there any way.. >> Waiting for a reply.I have posted the question every where..but none is >> responding back. >> I feel like this is the right place to ask doubts. As some of u may came >> across the same issue and get stuck. >> >> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <[email protected] >> > wrote: >> >>> Yes, One of my friend is implemeting the same. I know global sharing of >>> Data is not possible across Hadoop MapReduce. But I need to check if that >>> can be done somehow in hadoop Mapreduce also. Because I found some papers >>> in KNN hadoop also. >>> And I trying to compare the performance too. >>> >>> Hope some pointers can help me. >>> >>> >>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <[email protected]> >>> wrote: >>> >>>> >>>> have you considered implementing using something like spark? That >>>> could be much easier than raw map-reduce >>>> >>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>> [email protected]> wrote: >>>> >>>>> In KNN like algorithm we need to load model Data into cache for >>>>> predicting the records. >>>>> >>>>> Here is the example for KNN. >>>>> >>>>> >>>>> [image: Inline image 1] >>>>> >>>>> So if the model will be a large file say1 or 2 GB we will be able to >>>>> load them into Distributed cache. >>>>> >>>>> The one way is to split/partition the model Result into some files and >>>>> perform the distance calculation for all records in that file and then >>>>> find >>>>> the min ditance and max occurance of classlabel and predict the outcome. >>>>> >>>>> How can we parttion the file and perform the operation on these >>>>> partition ? >>>>> >>>>> ie 1 record <Distance> parttition1,partition2,.... >>>>> 2nd record <Distance> parttition1,partition2,... >>>>> >>>>> This is what came to my thought. >>>>> >>>>> Is there any further way. >>>>> >>>>> Any pointers would help me. >>>>> >>>>> -- >>>>> *Thanks & Regards * >>>>> >>>>> >>>>> *Unmesha Sreeveni U.B* >>>>> *Hadoop, Bigdata Developer* >>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> *Thanks & Regards * >>> >>> >>> *Unmesha Sreeveni U.B* >>> *Hadoop, Bigdata Developer* >>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>> http://www.unmeshasreeveni.blogspot.in/ >>> >>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
