Hi Mikael So you wont write an MR job for counting the number of records in that file to find 80% and 20%
On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <[email protected]> wrote: > > I would use a different approach. For each row in the mapper I would have > invoked random.Next() then if the number generated by random is below 0.8 > then the row would go to key for training otherwise go to key for the test. > Mikael.s > ------------------------------ > From: Susheel Kumar Gadalay <[email protected]> > Sent: 12/12/2014 12:00 > To: [email protected] > Subject: Re: Split files into 80% and 20% for building model and > prediction > > Simple solution.. > > Copy the HDFS file to local and use OS commands to count no of lines > > cat file1 | wc -l > > and cut it based on line number. > > > On 12/12/14, unmesha sreeveni <[email protected]> wrote: > > I am trying to divide my HDFS file into 2 parts/files > > 80% and 20% for classification algorithm(80% for modelling and 20% for > > prediction) > > Please provide suggestion for the same. > > To take 80% and 20% to 2 seperate files we need to know the exact number > of > > record in the data set > > And it is only known if we go through the data set once. > > so we need to write 1 MapReduce Job for just counting the number of > records > > and > > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple > > Inputs. > > > > > > Am I in the right track or there is any alternative for the same. > > But again a small confusion how to check if the reducer get filled with > 80% > > data. > > > > > > -- > > *Thanks & Regards * > > > > > > *Unmesha Sreeveni U.B* > > *Hadoop, Bigdata Developer* > > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* > > http://www.unmeshasreeveni.blogspot.in/ > > > -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
