I have 4 nodes and the replication factor is set to 3 On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <[email protected]> wrote:
> Yes, almost same. I assume the most time spending part was copying model > data from datanode which has model data to actual process node(tasktracker > or nodemanager). > > How about the model data's replication factor? How many nodes do you have? > If you have 4 or more nodes, you can increase replication with following > command. I suggest the number equal to your datanodes, but first you should > confirm the enough space in HDFS. > > > - hdfs dfs -setrep -w 6 /user/model/data > > > > > Drake 민영근 Ph.D > > On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <[email protected]> > wrote: > >> Yes I tried the same Drake. >> >> I dont know if I understood your answer. >> >> Instead of loading them into setup() through cache I read them directly >> from HDFS in map section. and for each incoming record .I found the >> distance between all the records in HDFS. >> ie if R ans S are my dataset, R is the model data stored in HDFs >> and when S taken for processing >> S1-R(finding distance with whole R set) >> S2-R >> >> But it is taking a long time as it needs to compute the distance. >> >> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <[email protected]> wrote: >> >>> In my suggestion, map or reduce tasks do not use distributed cache. They >>> use file directly from HDFS with short circuit local read. Like a shared >>> storage method, but almost every node has the data with high-replication >>> factor. >>> >>> Drake 민영근 Ph.D >>> >>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <[email protected] >>> > wrote: >>> >>>> But stil if the model is very large enough, how can we load them inti >>>> Distributed cache or some thing like that. >>>> Here is one source : >>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf >>>> But it is confusing me >>>> >>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> How about this ? The large model data stay in HDFS but with many >>>>> replications and MapReduce program read the model from HDFS. In theory, >>>>> the >>>>> replication factor of model data equals with number of data nodes and with >>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce >>>>> tasks read the model data in their own disks. >>>>> >>>>> In this way, maybe use too many usage of HDFS, but the annoying >>>>> partition problem will be gone. >>>>> >>>>> Thanks >>>>> >>>>> Drake 민영근 Ph.D >>>>> >>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni < >>>>> [email protected]> wrote: >>>>> >>>>>> Is there any way.. >>>>>> Waiting for a reply.I have posted the question every where..but none >>>>>> is responding back. >>>>>> I feel like this is the right place to ask doubts. As some of u may >>>>>> came across the same issue and get stuck. >>>>>> >>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Yes, One of my friend is implemeting the same. I know global sharing >>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if >>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some >>>>>>> papers in KNN hadoop also. >>>>>>> And I trying to compare the performance too. >>>>>>> >>>>>>> Hope some pointers can help me. >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> >>>>>>>> have you considered implementing using something like spark? That >>>>>>>> could be much easier than raw map-reduce >>>>>>>> >>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> In KNN like algorithm we need to load model Data into cache for >>>>>>>>> predicting the records. >>>>>>>>> >>>>>>>>> Here is the example for KNN. >>>>>>>>> >>>>>>>>> >>>>>>>>> [image: Inline image 1] >>>>>>>>> >>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able >>>>>>>>> to load them into Distributed cache. >>>>>>>>> >>>>>>>>> The one way is to split/partition the model Result into some files >>>>>>>>> and perform the distance calculation for all records in that file and >>>>>>>>> then >>>>>>>>> find the min ditance and max occurance of classlabel and predict the >>>>>>>>> outcome. >>>>>>>>> >>>>>>>>> How can we parttion the file and perform the operation on these >>>>>>>>> partition ? >>>>>>>>> >>>>>>>>> ie 1 record <Distance> parttition1,partition2,.... >>>>>>>>> 2nd record <Distance> parttition1,partition2,... >>>>>>>>> >>>>>>>>> This is what came to my thought. >>>>>>>>> >>>>>>>>> Is there any further way. >>>>>>>>> >>>>>>>>> Any pointers would help me. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Thanks & Regards * >>>>>>>>> >>>>>>>>> >>>>>>>>> *Unmesha Sreeveni U.B* >>>>>>>>> *Hadoop, Bigdata Developer* >>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Thanks & Regards * >>>>>>> >>>>>>> >>>>>>> *Unmesha Sreeveni U.B* >>>>>>> *Hadoop, Bigdata Developer* >>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Thanks & Regards * >>>>>> >>>>>> >>>>>> *Unmesha Sreeveni U.B* >>>>>> *Hadoop, Bigdata Developer* >>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>>>> http://www.unmeshasreeveni.blogspot.in/ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Thanks & Regards * >>>> >>>> >>>> *Unmesha Sreeveni U.B* >>>> *Hadoop, Bigdata Developer* >>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >>>> http://www.unmeshasreeveni.blogspot.in/ >>>> >>>> >>>> >>> >> >> >> -- >> *Thanks & Regards * >> >> >> *Unmesha Sreeveni U.B* >> *Hadoop, Bigdata Developer* >> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> http://www.unmeshasreeveni.blogspot.in/ >> >> >> > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
