Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

unmesha sreeveni Tue, 20 Jan 2015 22:10:15 -0800

I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 <[email protected]> wrote:


> Yes, almost same. I assume the most time spending part was copying model
> data from datanode which has model data to actual process node(tasktracker
> or nodemanager).
>
> How about the model data's replication factor? How many nodes do you have?
> If you have 4 or more nodes, you can increase replication with following
> command. I suggest the number equal to your datanodes, but first you should
> confirm the enough space in HDFS.
>
>
>    - hdfs dfs -setrep -w 6 /user/model/data
>
>
>
>
> Drake 민영근 Ph.D
>
> On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni <[email protected]>
> wrote:
>
>> Yes I tried the same Drake.
>>
>> I dont know if I understood your answer.
>>
>>  Instead of loading them into setup() through cache I read them directly
>> from HDFS in map section. and for each incoming record .I found the
>> distance between all the records in HDFS.
>> ie if R ans S are my dataset, R is the model data stored in HDFs
>> and when S taken for processing
>> S1-R(finding distance with whole R set)
>> S2-R
>>
>> But it is taking a long time as it needs to compute the distance.
>>
>> On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 <[email protected]> wrote:
>>
>>> In my suggestion, map or reduce tasks do not use distributed cache. They
>>> use file directly from HDFS with short circuit local read. Like a shared
>>> storage method, but almost every node has the data with high-replication
>>> factor.
>>>
>>> Drake 민영근 Ph.D
>>>
>>> On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <[email protected]
>>> > wrote:
>>>
>>>> But stil if the model is very large enough, how can we load them inti
>>>> Distributed cache or some thing like that.
>>>> Here is one source :
>>>> http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
>>>> But it is confusing me
>>>>
>>>> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How about this ? The large model data stay in HDFS but with many
>>>>> replications and MapReduce program read the model from HDFS. In theory, 
>>>>> the
>>>>> replication factor of model data equals with number of data nodes and with
>>>>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>>>>> tasks read the model data in their own disks.
>>>>>
>>>>> In this way, maybe use too many usage of HDFS, but the annoying
>>>>> partition problem will be gone.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Drake 민영근 Ph.D
>>>>>
>>>>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Is there any way..
>>>>>> Waiting for a reply.I have posted the question every where..but none
>>>>>> is responding back.
>>>>>> I feel like this is the right place to ask doubts. As some of u may
>>>>>> came across the same issue and get stuck.
>>>>>>
>>>>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Yes, One of my friend is implemeting the same. I know global sharing
>>>>>>> of Data is not possible across Hadoop MapReduce. But I need to check if
>>>>>>> that can be done somehow in hadoop Mapreduce also. Because I found some
>>>>>>> papers in KNN hadoop also.
>>>>>>> And I trying to compare the performance too.
>>>>>>>
>>>>>>> Hope some pointers can help me.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> have you considered implementing using something like spark?  That
>>>>>>>> could be much easier than raw map-reduce
>>>>>>>>
>>>>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>>>>> predicting the records.
>>>>>>>>>
>>>>>>>>> Here is the example for KNN.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [image: Inline image 1]
>>>>>>>>>
>>>>>>>>> So if the model will be a large file say1 or 2 GB we will be able
>>>>>>>>> to load them into Distributed cache.
>>>>>>>>>
>>>>>>>>> The one way is to split/partition the model Result into some files
>>>>>>>>> and perform the distance calculation for all records in that file and 
>>>>>>>>> then
>>>>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>>>>> outcome.
>>>>>>>>>
>>>>>>>>> How can we parttion the file and perform the operation on these
>>>>>>>>> partition ?
>>>>>>>>>
>>>>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>>>>
>>>>>>>>> This is what came to my thought.
>>>>>>>>>
>>>>>>>>> Is there any further way.
>>>>>>>>>
>>>>>>>>> Any pointers would help me.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Thanks & Regards *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Thanks & Regards *
>>>>>>>
>>>>>>>
>>>>>>> *Unmesha Sreeveni U.B*
>>>>>>> *Hadoop, Bigdata Developer*
>>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Thanks & Regards *
>>
>>
>> *Unmesha Sreeveni U.B*
>> *Hadoop, Bigdata Developer*
>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>> http://www.unmeshasreeveni.blogspot.in/
>>
>>
>>
>


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Reply via email to