Hi Unmesha, If you use the approach suggested by Mikael of taking the random 80% of data for training and rest for testing then you can have good distribution to generate your predictive model.
Thanks, Hitarth > On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <[email protected]> wrote: > > Hi Mikael > So you wont write an MR job for counting the number of records in that file > to find 80% and 20% > >> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <[email protected]> >> wrote: >> I would use a different approach. For each row in the mapper I would have >> invoked random.Next() then if the number generated by random is below 0.8 >> then the row would go to key for training otherwise go to key for the test. >> Mikael.s >> From: Susheel Kumar Gadalay >> Sent: 12/12/2014 12:00 >> To: [email protected] >> Subject: Re: Split files into 80% and 20% for building model and prediction >> >> Simple solution.. >> >> Copy the HDFS file to local and use OS commands to count no of lines >> >> cat file1 | wc -l >> >> and cut it based on line number. >> >> >> On 12/12/14, unmesha sreeveni <[email protected]> wrote: >> > I am trying to divide my HDFS file into 2 parts/files >> > 80% and 20% for classification algorithm(80% for modelling and 20% for >> > prediction) >> > Please provide suggestion for the same. >> > To take 80% and 20% to 2 seperate files we need to know the exact number of >> > record in the data set >> > And it is only known if we go through the data set once. >> > so we need to write 1 MapReduce Job for just counting the number of records >> > and >> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple >> > Inputs. >> > >> > >> > Am I in the right track or there is any alternative for the same. >> > But again a small confusion how to check if the reducer get filled with 80% >> > data. >> > >> > >> > -- >> > *Thanks & Regards * >> > >> > >> > *Unmesha Sreeveni U.B* >> > *Hadoop, Bigdata Developer* >> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* >> > http://www.unmeshasreeveni.blogspot.in/ >> > > > > -- > Thanks & Regards > > Unmesha Sreeveni U.B > Hadoop, Bigdata Developer > Centre for Cyber Security | Amrita Vishwa Vidyapeetham > http://www.unmeshasreeveni.blogspot.in/ > >
