Re: Implement Binary Search in PIG

唐亮 Wed, 14 Dec 2011 02:13:33 -0800

Now, I didn't use HBase,
so, maybe I can't use DistributedCache.

And if FLATTEN DataBag, the results are Tuples,
then in UDF I can process only one Tuple, which can't implement
BinarySearch.


So, please help and show me the detailed solution.
Thanks!

在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：

> Hi Prashant Kommireddi，
>
> If I do 1. and 2. as you mentioned，
> the schema will be {tag, ipStart, ipEnd, locName}.
>
> BUT, how should I write the UDF, especially how should I set the type of
> the input parameter?
>
> Currently, the UDF codes are as below, whose input parameter is DataBag:
>
> public class GetProvinceNameFromIPNum extends EvalFunc<String> {
>
>    public String exec(Tuple input) throws IOException {
> if (input == null || input.size() == 0)
>             return UnknownIP;
>  if (input.size() != 2) {
>     throw new IOException("Expected input's size is 2, but is: " +
> input.size());
>     }
>
>         Object o1 = input.get(0); * // This should be the IP you want to
> look up*
>         if (!(o1 instanceof Long)) {
>             throw new IOException("Expected input 1 to be Long, but got "
>             + o1.getClass().getName());
>         }
>         Object o2 = input.get(1);  *// This is the Bag of IP segs*
>         if (!(o2 instanceof *DataBag*)) {  //* Should I change it to "(o2
> instanceof Tuple)"?*
>             throw new IOException("Expected input 2 to be DataBag, but got
> "
>             + o2.getClass().getName());
>         }
>
>         ........... other codes ...........
>    }
>
> }
>
>
>
> 在 2011年12月14日 下午3:16，Prashant Kommireddi <[email protected]>写道：
>
> Seems like at the end of this you have a Single bag with all the elements,
>> and somehow you would like to check whether an element exists in it based
>> on ipstart/end.
>>
>>
>>   1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten -
>>   this will convert the Bag to Tuple:  to_tuple = FOREACH order_ip_segs
>>   GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
>>   2. Now write a UDF that can access the elements positionally for the
>>   BinarySearch
>>   3. Dmitriy and Jonathan's ideas with DistributedCache could perform
>>   better than the above approach, so you could go down that route too.
>>
>>
>> 2011/12/13 唐亮 <[email protected]>
>>
>> > The detailed PIG codes are as below:
>> >
>> > raw_ip_segment = load ...
>> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
>> > group_ip_segs = group ip_segs all;
>> >
>> > order_ip_segs = foreach group_ip_segs {
>> >  order_seg = order ip_segs by ipstart, ipend;
>> >  generate 't' as tag, order_seg;
>> > }
>> > describe order_ip_segs
>> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
>> long,poid:
>> > chararray}}
>> >
>> > Here, the order_ip_segs::order_seg is a BAG,
>> > how can I transer it to a TUPLE?
>> >
>> > And can I access the TUPLE randomly in UDF?
>> >
>> > 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
>> >
>> > > Then how can I transfer all the items in Bag to a Tuple?
>> > >
>> > >
>> > > 2011/12/14 Jonathan Coveney <[email protected]>
>> > >
>> > >> It's funny, but if you look wayyyy in the past, I actually asked a
>> bunch
>> > >> of
>> > >> questions that circled around, literally, this exact problem.
>> > >>
>> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF that
>> can
>> > >> do
>> > >> the lookup really efficiently. This is what the maxmind API does, for
>> > >> example.
>> > >>
>> > >> 2011/12/13 Prashant Kommireddi <[email protected]>
>> > >>
>> > >> > I am lost when you say "If enumerate every IP, it will be more than
>> > >> > 100000000 single IPs"
>> > >> >
>> > >> > If each bag is a collection of 30000 tuples it might not be too
>> bad on
>> > >> the
>> > >> > memory if you used Tuple to store segments instead?
>> > >> >
>> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
>> > >> > Lets say we incur an additional overhead 4X times this, which is
>> ~160
>> > >> bytes
>> > >> > per tuple.
>> > >> > Total per Bag = 30000 X 160 = ~5 MB
>> > >> >
>> > >> > You could probably store the ipsegments as Tuple and test it on
>> your
>> > >> > servers.
>> > >> >
>> > >> >
>> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
>> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > Do you have many such bags or just one? If one, and you want to
>> look
>> > >> up
>> > >> > > many ups in it, might be more efficient to serialize this
>> relation
>> > to
>> > >> > hdfs,
>> > >> > > and write a lookup udf that specifies the serialized data set as
>> a
>> > >> file
>> > >> > to
>> > >> > > put in distributed cache. At init time, load up the file into
>> > memory,
>> > >> > then
>> > >> > > for every ip do the binary search in exec()
>> > >> > >
>> > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
>> > >> > >
>> > >> > > > Thank you all!
>> > >> > > >
>> > >> > > > The detail is:
>> > >> > > > A bag contains many "IP Segments", whose schema is
>> (ipStart:long,
>> > >> > > > ipEnd:long, locName:chararray) and the number of tuples is
>> about
>> > >> 30000,
>> > >> > > > and I want to check wheather an IP is belong to one segment in
>> the
>> > >> bag.
>> > >> > > >
>> > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
>> > >> > > > and then binary search wheather an IP is in the bag in UDF.
>> > >> > > >
>> > >> > > > If enumerate every IP, it will be more than 100000000 single
>> IPs,
>> > >> > > > I think it will also be time consuming by JOIN in PIG.
>> > >> > > >
>> > >> > > > Please help me how can I deal with it efficiently!
>> > >> > > >
>> > >> > > >
>> > >> > > > 2011/12/14 Thejas Nair <[email protected]>
>> > >> > > >
>> > >> > > >> My assumption is that 唐亮 is trying to do binary search on bags
>> > >> within
>> > >> > > the
>> > >> > > >> tuples in a relation (ie schema of the relation has a bag
>> > column).
>> > >> I
>> > >> > > don't
>> > >> > > >> think he is trying to treat the entire relation as one bag
>> and do
>> > >> > binary
>> > >> > > >> search on that.
>> > >> > > >>
>> > >> > > >>
>> > >> > > >> -Thejas
>> > >> > > >>
>> > >> > > >>
>> > >> > > >>
>> > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
>> > >> > > >>
>> > >> > > >>> I don't think this could be done,
>> > >> > > >>>
>> > >> > > >>> pig is just a hadoop job, and the idea behind hadoop is to
>> read
>> > >> all
>> > >> > the
>> > >> > > >>> data in a file.
>> > >> > > >>>
>> > >> > > >>> so by the time you put all the data into an array, you would
>> > have
>> > >> > been
>> > >> > > >>> better off just checking each element for the one you were
>> > looking
>> > >> > for.
>> > >> > > >>>
>> > >> > > >>> So what you would get is [n + lg (n)], which will just be [n]
>> > >> after
>> > >> > > >>> putting
>> > >> > > >>> that into an array.
>> > >> > > >>> Second, hadoop is all about large data analysis, usually more
>> > than
>> > >> > > 100GB,
>> > >> > > >>> so putting this into memory is out of the question.
>> > >> > > >>> Third, hadoop is efficient because it processes this large
>> > amount
>> > >> of
>> > >> > > data
>> > >> > > >>> by splitting it up into multiple processes. To do an
>> efficient
>> > >> binary
>> > >> > > >>> search, you would need do this in one mapper or one reducer.
>> > >> > > >>>
>> > >> > > >>> My opinion is just don't fight hadoop/pig.
>> > >> > > >>>
>> > >> > > >>>
>> > >> > > >>>
>> > >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
>> > >> [email protected]>
>> > >> > > >>> wrote:
>> > >> > > >>>
>> > >> > > >>> Bags can be very large might not fit into memory, and in such
>> > >> cases
>> > >> > > some
>> > >> > > >>>> or all of the bag might have to be stored on disk. In such
>> > >> cases, it
>> > >> > > is
>> > >> > > >>>> not
>> > >> > > >>>> efficient to do random access on the bag. That is why the
>> > DataBag
>> > >> > > >>>> interface
>> > >> > > >>>> does not support it.
>> > >> > > >>>>
>> > >> > > >>>> As Prashant suggested, storing it in a tuple would be a good
>> > >> > > alternative,
>> > >> > > >>>> if you want to have random access to do binary search.
>> > >> > > >>>>
>> > >> > > >>>> -Thejas
>> > >> > > >>>>
>> > >> > > >>>>
>> > >> > > >>>>
>> > >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
>> > >> > > >>>>
>> > >> > > >>>> Hi all,
>> > >> > > >>>>> How can I implement a binary search in pig?
>> > >> > > >>>>>
>> > >> > > >>>>> In one relation, there exists a bag whose items are sorted.
>> > >> > > >>>>> And I want to check there exists a specific item in the
>> bag.
>> > >> > > >>>>>
>> > >> > > >>>>> In UDF, I can't random access items in DataBag container.
>> > >> > > >>>>> So I have to transfer the items in DataBag to an ArrayList,
>> > and
>> > >> > this
>> > >> > > is
>> > >> > > >>>>> time consuming.
>> > >> > > >>>>>
>> > >> > > >>>>> How can I implement the binary search efficiently in pig?
>> > >> > > >>>>>
>> > >> > > >>>>>
>> > >> > > >>>>>
>> > >> > > >>>>
>> > >> > > >>>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Implement Binary Search in PIG

Reply via email to