Re: Implement Binary Search in PIG

jiang licht Wed, 14 Dec 2011 11:18:55 -0800

If that list of ip pairs is pretty static most time and will be used 
frequently, maybe just copy it in hdfs with a high replication factor. Then use 
it as a look up table or some binary tree or treemap kind of thing by reading 
it from hdfs instead of using distributed cache if that sounds an easier thing 
to do.


 
Best regards,
Michael


________________________________
 From: Dmitriy Ryaboy <[email protected]>
To: [email protected] 
Sent: Wednesday, December 14, 2011 10:28 AM
Subject: Re: Implement Binary Search in PIG
 
hbase has nothing to do with distributed cache.


2011/12/14 唐亮 <[email protected]>

> Now, I didn't use HBase,
> so, maybe I can't use DistributedCache.
>
> And if FLATTEN DataBag, the results are Tuples,
> then in UDF I can process only one Tuple, which can't implement
> BinarySearch.
>
> So, please help and show me the detailed solution.
> Thanks!
>
> 在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：
>
> > Hi Prashant Kommireddi，
> >
> > If I do 1. and 2. as you mentioned，
> > the schema will be {tag, ipStart, ipEnd, locName}.
> >
> > BUT, how should I write the UDF, especially how should I set the type of
> > the input parameter?
> >
> > Currently, the UDF codes are as below, whose input parameter is DataBag:
> >
> > public class GetProvinceNameFromIPNum extends EvalFunc<String> {
> >
> >    public String exec(Tuple input) throws IOException {
> > if (input == null || input.size() == 0)
> >             return UnknownIP;
> >  if (input.size() != 2) {
> >     throw new IOException("Expected input's size is 2, but is: " +
> > input.size());
> >     }
> >
> >         Object o1 = input.get(0); * // This should be the IP you want to
> > look up*
> >         if (!(o1 instanceof Long)) {
> >             throw new IOException("Expected input 1 to be Long, but got "
> >             + o1.getClass().getName());
> >         }
> >         Object o2 = input.get(1);  *// This is the Bag of IP segs*
> >         if (!(o2 instanceof *DataBag*)) {  //* Should I change it to "(o2
> > instanceof Tuple)"?*
> >             throw new IOException("Expected input 2 to be DataBag, but
> got
> > "
> >             + o2.getClass().getName());
> >         }
> >
> >         ........... other codes ...........
> >    }
> >
> > }
> >
> >
> >
> > 在 2011年12月14日 下午3:16，Prashant Kommireddi <[email protected]>写道：
> >
> > Seems like at the end of this you have a Single bag with all the
> elements,
> >> and somehow you would like to check whether an element exists in it
> based
> >> on ipstart/end.
> >>
> >>
> >>   1. Use FLATTEN http://pig.apache.org/docs/r0.9.1/basic.html#flatten -
> >>   this will convert the Bag to Tuple:  to_tuple = FOREACH order_ip_segs
> >>   GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
> >>   2. Now write a UDF that can access the elements positionally for the
> >>   BinarySearch
> >>   3. Dmitriy and Jonathan's ideas with DistributedCache could perform
> >>   better than the above approach, so you could go down that route too.
> >>
> >>
> >> 2011/12/13 唐亮 <[email protected]>
> >>
> >> > The detailed PIG codes are as below:
> >> >
> >> > raw_ip_segment = load ...
> >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
> >> > group_ip_segs = group ip_segs all;
> >> >
> >> > order_ip_segs = foreach group_ip_segs {
> >> >  order_seg = order ip_segs by ipstart, ipend;
> >> >  generate 't' as tag, order_seg;
> >> > }
> >> > describe order_ip_segs
> >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
> >> long,poid:
> >> > chararray}}
> >> >
> >> > Here, the order_ip_segs::order_seg is a BAG,
> >> > how can I transer it to a TUPLE?
> >> >
> >> > And can I access the TUPLE randomly in UDF?
> >> >
> >> > 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
> >> >
> >> > > Then how can I transfer all the items in Bag to a Tuple?
> >> > >
> >> > >
> >> > > 2011/12/14 Jonathan Coveney <[email protected]>
> >> > >
> >> > >> It's funny, but if you look wayyyy in the past, I actually asked a
> >> bunch
> >> > >> of
> >> > >> questions that circled around, literally, this exact problem.
> >> > >>
> >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF
> that
> >> can
> >> > >> do
> >> > >> the lookup really efficiently. This is what the maxmind API does,
> for
> >> > >> example.
> >> > >>
> >> > >> 2011/12/13 Prashant Kommireddi <[email protected]>
> >> > >>
> >> > >> > I am lost when you say "If enumerate every IP, it will be more
> than
> >> > >> > 100000000 single IPs"
> >> > >> >
> >> > >> > If each bag is a collection of 30000 tuples it might not be too
> >> bad on
> >> > >> the
> >> > >> > memory if you used Tuple to store segments instead?
> >> > >> >
> >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> >> > >> > Lets say we incur an additional overhead 4X times this, which is
> >> ~160
> >> > >> bytes
> >> > >> > per tuple.
> >> > >> > Total per Bag = 30000 X 160 = ~5 MB
> >> > >> >
> >> > >> > You could probably store the ipsegments as Tuple and test it on
> >> your
> >> > >> > servers.
> >> > >> >
> >> > >> >
> >> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
> >> [email protected]>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Do you have many such bags or just one? If one, and you want to
> >> look
> >> > >> up
> >> > >> > > many ups in it, might be more efficient to serialize this
> >> relation
> >> > to
> >> > >> > hdfs,
> >> > >> > > and write a lookup udf that specifies the serialized data set
> as
> >> a
> >> > >> file
> >> > >> > to
> >> > >> > > put in distributed cache. At init time, load up the file into
> >> > memory,
> >> > >> > then
> >> > >> > > for every ip do the binary search in exec()
> >> > >> > >
> >> > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]> wrote:
> >> > >> > >
> >> > >> > > > Thank you all!
> >> > >> > > >
> >> > >> > > > The detail is:
> >> > >> > > > A bag contains many "IP Segments", whose schema is
> >> (ipStart:long,
> >> > >> > > > ipEnd:long, locName:chararray) and the number of tuples is
> >> about
> >> > >> 30000,
> >> > >> > > > and I want to check wheather an IP is belong to one segment
> in
> >> the
> >> > >> bag.
> >> > >> > > >
> >> > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in MR,
> >> > >> > > > and then binary search wheather an IP is in the bag in UDF.
> >> > >> > > >
> >> > >> > > > If enumerate every IP, it will be more than 100000000 single
> >> IPs,
> >> > >> > > > I think it will also be time consuming by JOIN in PIG.
> >> > >> > > >
> >> > >> > > > Please help me how can I deal with it efficiently!
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > 2011/12/14 Thejas Nair <[email protected]>
> >> > >> > > >
> >> > >> > > >> My assumption is that 唐亮 is trying to do binary search on
> bags
> >> > >> within
> >> > >> > > the
> >> > >> > > >> tuples in a relation (ie schema of the relation has a bag
> >> > column).
> >> > >> I
> >> > >> > > don't
> >> > >> > > >> think he is trying to treat the entire relation as one bag
> >> and do
> >> > >> > binary
> >> > >> > > >> search on that.
> >> > >> > > >>
> >> > >> > > >>
> >> > >> > > >> -Thejas
> >> > >> > > >>
> >> > >> > > >>
> >> > >> > > >>
> >> > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> >> > >> > > >>
> >> > >> > > >>> I don't think this could be done,
> >> > >> > > >>>
> >> > >> > > >>> pig is just a hadoop job, and the idea behind hadoop is to
> >> read
> >> > >> all
> >> > >> > the
> >> > >> > > >>> data in a file.
> >> > >> > > >>>
> >> > >> > > >>> so by the time you put all the data into an array, you
> would
> >> > have
> >> > >> > been
> >> > >> > > >>> better off just checking each element for the one you were
> >> > looking
> >> > >> > for.
> >> > >> > > >>>
> >> > >> > > >>> So what you would get is [n + lg (n)], which will just be
> [n]
> >> > >> after
> >> > >> > > >>> putting
> >> > >> > > >>> that into an array.
> >> > >> > > >>> Second, hadoop is all about large data analysis, usually
> more
> >> > than
> >> > >> > > 100GB,
> >> > >> > > >>> so putting this into memory is out of the question.
> >> > >> > > >>> Third, hadoop is efficient because it processes this large
> >> > amount
> >> > >> of
> >> > >> > > data
> >> > >> > > >>> by splitting it up into multiple processes. To do an
> >> efficient
> >> > >> binary
> >> > >> > > >>> search, you would need do this in one mapper or one
> reducer.
> >> > >> > > >>>
> >> > >> > > >>> My opinion is just don't fight hadoop/pig.
> >> > >> > > >>>
> >> > >> > > >>>
> >> > >> > > >>>
> >> > >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> >> > >> [email protected]>
> >> > >> > > >>> wrote:
> >> > >> > > >>>
> >> > >> > > >>> Bags can be very large might not fit into memory, and in
> such
> >> > >> cases
> >> > >> > > some
> >> > >> > > >>>> or all of the bag might have to be stored on disk. In such
> >> > >> cases, it
> >> > >> > > is
> >> > >> > > >>>> not
> >> > >> > > >>>> efficient to do random access on the bag. That is why the
> >> > DataBag
> >> > >> > > >>>> interface
> >> > >> > > >>>> does not support it.
> >> > >> > > >>>>
> >> > >> > > >>>> As Prashant suggested, storing it in a tuple would be a
> good
> >> > >> > > alternative,
> >> > >> > > >>>> if you want to have random access to do binary search.
> >> > >> > > >>>>
> >> > >> > > >>>> -Thejas
> >> > >> > > >>>>
> >> > >> > > >>>>
> >> > >> > > >>>>
> >> > >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> >> > >> > > >>>>
> >> > >> > > >>>> Hi all,
> >> > >> > > >>>>> How can I implement a binary search in pig?
> >> > >> > > >>>>>
> >> > >> > > >>>>> In one relation, there exists a bag whose items are
> sorted.
> >> > >> > > >>>>> And I want to check there exists a specific item in the
> >> bag.
> >> > >> > > >>>>>
> >> > >> > > >>>>> In UDF, I can't random access items in DataBag container.
> >> > >> > > >>>>> So I have to transfer the items in DataBag to an
> ArrayList,
> >> > and
> >> > >> > this
> >> > >> > > is
> >> > >> > > >>>>> time consuming.
> >> > >> > > >>>>>
> >> > >> > > >>>>> How can I implement the binary search efficiently in pig?
> >> > >> > > >>>>>
> >> > >> > > >>>>>
> >> > >> > > >>>>>
> >> > >> > > >>>>
> >> > >> > > >>>
> >> > >> > > >>
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Implement Binary Search in PIG

Reply via email to