Re: Split files into 80% and 20% for building model and prediction

Hitarth Fri, 12 Dec 2014 05:20:57 -0800

Hi Unmesha, 

If you use the approach suggested by Mikael of taking the random 80% of data 
for training and rest for testing then you can have good distribution to 
generate your predictive model.


Thanks,
Hitarth

> On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <[email protected]> wrote:
> 
> Hi Mikael
>  So you wont write an MR job for counting the number of records in that file 
> to find 80% and 20%
> 
>> On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <[email protected]> 
>> wrote:
>> I would use a different approach. For each row in the mapper I would have 
>> invoked random.Next() then if the number generated by random is below 0.8 
>> then the row would go to key for training otherwise go to key for the test.
>> Mikael.s
>> From: Susheel Kumar Gadalay
>> Sent: ‎12/‎12/‎2014 12:00
>> To: [email protected]
>> Subject: Re: Split files into 80% and 20% for building model and prediction
>> 
>> Simple solution..
>> 
>> Copy the HDFS file to local and use OS commands to count no of lines
>> 
>> cat file1 | wc -l
>> 
>> and cut it based on line number.
>> 
>> 
>> On 12/12/14, unmesha sreeveni <[email protected]> wrote:
>> > I am trying to divide my HDFS file into 2 parts/files
>> > 80% and 20% for classification algorithm(80% for modelling and 20% for
>> > prediction)
>> > Please provide suggestion for the same.
>> > To take 80% and 20% to 2 seperate files we need to know the exact number of
>> > record in the data set
>> > And it is only known if we go through the data set once.
>> > so we need to write 1 MapReduce Job for just counting the number of records
>> > and
>> > 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
>> > Inputs.
>> >
>> >
>> > Am I in the right track or there is any alternative for the same.
>> > But again a small confusion how to check if the reducer get filled with 80%
>> > data.
>> >
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
> 
> 
> -- 
> Thanks & Regards
> 
> Unmesha Sreeveni U.B
> Hadoop, Bigdata Developer
> Centre for Cyber Security | Amrita Vishwa Vidyapeetham
> http://www.unmeshasreeveni.blogspot.in/
> 
>

Re: Split files into 80% and 20% for building model and prediction

Reply via email to