RE: Split files into 80% and 20% for building model and prediction

Mikael Sitruk Fri, 12 Dec 2014 05:39:06 -0800

Hi Unmesha 
With the random approach you don't need to write the MR job for counting.


Mikael.s

-----Original Message-----
From: "Hitarth" <[email protected]>
Sent: ‎12/‎12/‎2014 15:20
To: "[email protected]" <[email protected]>
Subject: Re: Split files into 80% and 20% for building model and prediction

Hi Unmesha, 


If you use the approach suggested by Mikael of taking the random 80% of data 
for training and rest for testing then you can have good distribution to 
generate your predictive model. 

Thanks,
Hitarth

On Dec 12, 2014, at 6:00 AM, unmesha sreeveni <[email protected]> wrote:


Hi Mikael
 So you wont write an MR job for counting the number of records in that file to 
find 80% and 20%


On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk <[email protected]> wrote:
I would use a different approach. For each row in the mapper I would have 
invoked random.Next() then if the number generated by random is below 0.8 then 
the row would go to key for training otherwise go to key for the test.
Mikael.s


From: Susheel Kumar Gadalay
Sent: ‎12/‎12/‎2014 12:00
To: [email protected]
Subject: Re: Split files into 80% and 20% for building model and prediction


Simple solution..

Copy the HDFS file to local and use OS commands to count no of lines

cat file1 | wc -l

and cut it based on line number.


On 12/12/14, unmesha sreeveni <[email protected]> wrote:
> I am trying to divide my HDFS file into 2 parts/files
> 80% and 20% for classification algorithm(80% for modelling and 20% for
> prediction)
> Please provide suggestion for the same.
> To take 80% and 20% to 2 seperate files we need to know the exact number of
> record in the data set
> And it is only known if we go through the data set once.
> so we need to write 1 MapReduce Job for just counting the number of records
> and
> 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
> Inputs.
>
>
> Am I in the right track or there is any alternative for the same.
> But again a small confusion how to check if the reducer get filled with 80%
> data.
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>





-- 

Thanks & Regards 


Unmesha Sreeveni U.B

Hadoop, Bigdata Developer
Centre for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/

RE: Split files into 80% and 20% for building model and prediction

Reply via email to