Generally speaking, fancy algorithms for single machine are often time not doable in a m/r manner, think about graph operations. So, go back to the original goal, what you want is to search for occurrence of sth in sth else. For the purpose of doing this in pig, I guess maybe one can do a left outer join, in the result, any tuple that you get null from the other participant in the join, it is a mismatch. Will this work? But I believe one will not try to do a binary search on a bag, unless it is small. Generally speaking, either a map-side or reduce-side search will do the job for you.
Best regards, Michael ________________________________ From: Andrew Wells <[email protected]> To: [email protected] Sent: Tuesday, December 13, 2011 2:32 PM Subject: Re: Implement Binary Search in PIG Oh, I might as well make a suggestion for random access. Try looking into HBase On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells <[email protected]> wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all the data into an array, you would have been > better off just checking each element for the one you were looking for. > > So what you would get is [n + lg (n)], which will just be [n] after > putting that into an array. > Second, hadoop is all about large data analysis, usually more than 100GB, > so putting this into memory is out of the question. > Third, hadoop is efficient because it processes this large amount of data > by splitting it up into multiple processes. To do an efficient binary > search, you would need do this in one mapper or one reducer. > > My opinion is just don't fight hadoop/pig. > > > > On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[email protected]>wrote: > >> Bags can be very large might not fit into memory, and in such cases some >> or all of the bag might have to be stored on disk. In such cases, it is not >> efficient to do random access on the bag. That is why the DataBag interface >> does not support it. >> >> As Prashant suggested, storing it in a tuple would be a good alternative, >> if you want to have random access to do binary search. >> >> -Thejas >> >> >> >> On 12/12/11 7:54 PM, 唐亮 wrote: >> >>> Hi all, >>> How can I implement a binary search in pig? >>> >>> In one relation, there exists a bag whose items are sorted. >>> And I want to check there exists a specific item in the bag. >>> >>> In UDF, I can't random access items in DataBag container. >>> So I have to transfer the items in DataBag to an ArrayList, and this is >>> time consuming. >>> >>> How can I implement the binary search efficiently in pig? >>> >>> >> >
