Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Drake민영근 Tue, 20 Jan 2015 21:03:07 -0800

In my suggestion, map or reduce tasks do not use distributed cache. They
use file directly from HDFS with short circuit local read. Like a shared
storage method, but almost every node has the data with high-replication
factor.


Drake 민영근 Ph.D

On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni <[email protected]>
wrote:

> But stil if the model is very large enough, how can we load them inti
> Distributed cache or some thing like that.
> Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
> But it is confusing me
>
> On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 <[email protected]> wrote:
>
>> Hi,
>>
>> How about this ? The large model data stay in HDFS but with many
>> replications and MapReduce program read the model from HDFS. In theory, the
>> replication factor of model data equals with number of data nodes and with
>> the Short Circuit Local Reads function of HDFS datanode, the map or reduce
>> tasks read the model data in their own disks.
>>
>> In this way, maybe use too many usage of HDFS, but the annoying partition
>> problem will be gone.
>>
>> Thanks
>>
>> Drake 민영근 Ph.D
>>
>> On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni <[email protected]>
>> wrote:
>>
>>> Is there any way..
>>> Waiting for a reply.I have posted the question every where..but none is
>>> responding back.
>>> I feel like this is the right place to ask doubts. As some of u may came
>>> across the same issue and get stuck.
>>>
>>> On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni <
>>> [email protected]> wrote:
>>>
>>>> Yes, One of my friend is implemeting the same. I know global sharing of
>>>> Data is not possible across Hadoop MapReduce. But I need to check if that
>>>> can be done somehow in hadoop Mapreduce also. Because I found some papers
>>>> in KNN hadoop also.
>>>> And I trying to compare the performance too.
>>>>
>>>> Hope some pointers can help me.
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> have you considered implementing using something like spark?  That
>>>>> could be much easier than raw map-reduce
>>>>>
>>>>> On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> In KNN like algorithm we need to load model Data into cache for
>>>>>> predicting the records.
>>>>>>
>>>>>> Here is the example for KNN.
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So if the model will be a large file say1 or 2 GB we will be able to
>>>>>> load them into Distributed cache.
>>>>>>
>>>>>> The one way is to split/partition the model Result into some files
>>>>>> and perform the distance calculation for all records in that file and 
>>>>>> then
>>>>>> find the min ditance and max occurance of classlabel and predict the
>>>>>> outcome.
>>>>>>
>>>>>> How can we parttion the file and perform the operation on these
>>>>>> partition ?
>>>>>>
>>>>>> ie  1 record <Distance> parttition1,partition2,....
>>>>>>      2nd record <Distance> parttition1,partition2,...
>>>>>>
>>>>>> This is what came to my thought.
>>>>>>
>>>>>> Is there any further way.
>>>>>>
>>>>>> Any pointers would help me.
>>>>>>
>>>>>> --
>>>>>> *Thanks & Regards *
>>>>>>
>>>>>>
>>>>>> *Unmesha Sreeveni U.B*
>>>>>> *Hadoop, Bigdata Developer*
>>>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Thanks & Regards *
>>>>
>>>>
>>>> *Unmesha Sreeveni U.B*
>>>> *Hadoop, Bigdata Developer*
>>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>>> http://www.unmeshasreeveni.blogspot.in/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Thanks & Regards *
>>>
>>>
>>> *Unmesha Sreeveni U.B*
>>> *Hadoop, Bigdata Developer*
>>> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
>>> http://www.unmeshasreeveni.blogspot.in/
>>>
>>>
>>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

Reply via email to