Oh, I might as well make a suggestion for random access.

Try looking into HBase

On Tue, Dec 13, 2011 at 5:30 PM, Andrew Wells <[email protected]> wrote:

> I don't think this could be done,
>
> pig is just a hadoop job, and the idea behind hadoop is to read all the
> data in a file.
>
> so by the time you put all the data into an array, you would have been
> better off just checking each element for the one you were looking for.
>
> So what you would get is [n + lg (n)], which will just be [n] after
> putting that into an array.
> Second, hadoop is all about large data analysis, usually more than 100GB,
> so putting this into memory is out of the question.
> Third, hadoop is efficient because it processes this large amount of data
> by splitting it up into multiple processes. To do an efficient binary
> search, you would need do this in one mapper or one reducer.
>
> My opinion is just don't fight hadoop/pig.
>
>
>
> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair <[email protected]>wrote:
>
>> Bags can be very large might not fit into memory, and in such cases some
>> or all of the bag might have to be stored on disk. In such cases, it is not
>> efficient to do random access on the bag. That is why the DataBag interface
>> does not support it.
>>
>> As Prashant suggested, storing it in a tuple would be a good alternative,
>> if you want to have random access to do binary search.
>>
>> -Thejas
>>
>>
>>
>> On 12/12/11 7:54 PM, 唐亮 wrote:
>>
>>> Hi all,
>>> How can I implement a binary search in pig?
>>>
>>> In one relation, there exists a bag whose items are sorted.
>>> And I want to check there exists a specific item in the bag.
>>>
>>> In UDF, I can't random access items in DataBag container.
>>> So I have to transfer the items in DataBag to an ArrayList, and this is
>>> time consuming.
>>>
>>> How can I implement the binary search efficiently in pig?
>>>
>>>
>>
>

Reply via email to