Oh, I might as well make a suggestion for random access. Try looking into HBase
On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells <[email protected]> wrote: > I don't think this could be done, > > pig is just a hadoop job, and the idea behind hadoop is to read all the > data in a file. > > so by the time you put all the data into an array, you would have been > better off just checking each element for the one you were looking for. > > So what you would get is [n + lg (n)], which will just be [n] after > putting that into an array. > Second, hadoop is all about large data analysis, usually more than 100GB, > so putting this into memory is out of the question. > Third, hadoop is efficient because it processes this large amount of data > by splitting it up into multiple processes. To do an efficient binary > search, you would need do this in one mapper or one reducer. > > My opinion is just don't fight hadoop/pig. > > > > On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[email protected]>wrote: > >> Bags can be very large might not fit into memory, and in such cases some >> or all of the bag might have to be stored on disk. In such cases, it is not >> efficient to do random access on the bag. That is why the DataBag interface >> does not support it. >> >> As Prashant suggested, storing it in a tuple would be a good alternative, >> if you want to have random access to do binary search. >> >> -Thejas >> >> >> >> On 12/12/11 7:54 PM, 唐亮 wrote: >> >>> Hi all, >>> How can I implement a binary search in pig? >>> >>> In one relation, there exists a bag whose items are sorted. >>> And I want to check there exists a specific item in the bag. >>> >>> In UDF, I can't random access items in DataBag container. >>> So I have to transfer the items in DataBag to an ArrayList, and this is >>> time consuming. >>> >>> How can I implement the binary search efficiently in pig? >>> >>> >> >
