Re: Implement Binary Search in PIG

唐亮 Wed, 14 Dec 2011 18:29:00 -0800

Now the question is:
How should I put all the "IP Segments" in one TUPLE?


Please help me!


2011/12/15 Prashant Kommireddi <[email protected]>

> Michael,
>
> This would have no benefit over using a DistributedCache. For a large
> cluster this would mean poor performance. If the file is static and needs
> to be looked-up across the cluster, DistributedCache would be a better
> approach.
>
> Thanks,
> Prashant
>
> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <[email protected]>
> wrote:
>
> > If that list of ip pairs is pretty static most time and will be used
> > frequently, maybe just copy it in hdfs with a high replication factor.
> Then
> > use it as a look up table or some binary tree or treemap kind of thing by
> > reading it from hdfs instead of using distributed cache if that sounds an
> > easier thing to do.
> >
> >
> > Best regards,
> > Michael
> >
> >
> > ________________________________
> >  From: Dmitriy Ryaboy <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, December 14, 2011 10:28 AM
> > Subject: Re: Implement Binary Search in PIG
> >
> > hbase has nothing to do with distributed cache.
> >
> >
> > 2011/12/14 唐亮 <[email protected]>
> >
> > > Now, I didn't use HBase,
> > > so, maybe I can't use DistributedCache.
> > >
> > > And if FLATTEN DataBag, the results are Tuples,
> > > then in UDF I can process only one Tuple, which can't implement
> > > BinarySearch.
> > >
> > > So, please help and show me the detailed solution.
> > > Thanks!
> > >
> > > 在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：
> > >
> > > > Hi Prashant Kommireddi，
> > > >
> > > > If I do 1. and 2. as you mentioned，
> > > > the schema will be {tag, ipStart, ipEnd, locName}.
> > > >
> > > > BUT, how should I write the UDF, especially how should I set the type
> > of
> > > > the input parameter?
> > > >
> > > > Currently, the UDF codes are as below, whose input parameter is
> > DataBag:
> > > >
> > > > public class GetProvinceNameFromIPNum extends EvalFunc<String> {
> > > >
> > > >    public String exec(Tuple input) throws IOException {
> > > > if (input == null || input.size() == 0)
> > > >             return UnknownIP;
> > > >  if (input.size() != 2) {
> > > >     throw new IOException("Expected input's size is 2, but is: " +
> > > > input.size());
> > > >     }
> > > >
> > > >         Object o1 = input.get(0); * // This should be the IP you want
> > to
> > > > look up*
> > > >         if (!(o1 instanceof Long)) {
> > > >             throw new IOException("Expected input 1 to be Long, but
> > got "
> > > >             + o1.getClass().getName());
> > > >         }
> > > >         Object o2 = input.get(1);  *// This is the Bag of IP segs*
> > > >         if (!(o2 instanceof *DataBag*)) {  //* Should I change it to
> > "(o2
> > > > instanceof Tuple)"?*
> > > >             throw new IOException("Expected input 2 to be DataBag,
> but
> > > got
> > > > "
> > > >             + o2.getClass().getName());
> > > >         }
> > > >
> > > >         ........... other codes ...........
> > > >    }
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > 在 2011年12月14日 下午3:16，Prashant Kommireddi <[email protected]>写道：
> > > >
> > > > Seems like at the end of this you have a Single bag with all the
> > > elements,
> > > >> and somehow you would like to check whether an element exists in it
> > > based
> > > >> on ipstart/end.
> > > >>
> > > >>
> > > >>   1. Use FLATTEN
> http://pig.apache.org/docs/r0.9.1/basic.html#flatten-
> > > >>   this will convert the Bag to Tuple:  to_tuple = FOREACH
> > order_ip_segs
> > > >>   GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
> > > >>   2. Now write a UDF that can access the elements positionally for
> the
> > > >>   BinarySearch
> > > >>   3. Dmitriy and Jonathan's ideas with DistributedCache could
> perform
> > > >>   better than the above approach, so you could go down that route
> too.
> > > >>
> > > >>
> > > >> 2011/12/13 唐亮 <[email protected]>
> > > >>
> > > >> > The detailed PIG codes are as below:
> > > >> >
> > > >> > raw_ip_segment = load ...
> > > >> > ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;
> > > >> > group_ip_segs = group ip_segs all;
> > > >> >
> > > >> > order_ip_segs = foreach group_ip_segs {
> > > >> >  order_seg = order ip_segs by ipstart, ipend;
> > > >> >  generate 't' as tag, order_seg;
> > > >> > }
> > > >> > describe order_ip_segs
> > > >> > order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend:
> > > >> long,poid:
> > > >> > chararray}}
> > > >> >
> > > >> > Here, the order_ip_segs::order_seg is a BAG,
> > > >> > how can I transer it to a TUPLE?
> > > >> >
> > > >> > And can I access the TUPLE randomly in UDF?
> > > >> >
> > > >> > 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
> > > >> >
> > > >> > > Then how can I transfer all the items in Bag to a Tuple?
> > > >> > >
> > > >> > >
> > > >> > > 2011/12/14 Jonathan Coveney <[email protected]>
> > > >> > >
> > > >> > >> It's funny, but if you look wayyyy in the past, I actually
> asked
> > a
> > > >> bunch
> > > >> > >> of
> > > >> > >> questions that circled around, literally, this exact problem.
> > > >> > >>
> > > >> > >> Dmitriy and Prahsant are correct: the best way is to make a UDF
> > > that
> > > >> can
> > > >> > >> do
> > > >> > >> the lookup really efficiently. This is what the maxmind API
> does,
> > > for
> > > >> > >> example.
> > > >> > >>
> > > >> > >> 2011/12/13 Prashant Kommireddi <[email protected]>
> > > >> > >>
> > > >> > >> > I am lost when you say "If enumerate every IP, it will be
> more
> > > than
> > > >> > >> > 100000000 single IPs"
> > > >> > >> >
> > > >> > >> > If each bag is a collection of 30000 tuples it might not be
> too
> > > >> bad on
> > > >> > >> the
> > > >> > >> > memory if you used Tuple to store segments instead?
> > > >> > >> >
> > > >> > >> > (8 bytes long + 8 bytes long + 20 bytes for chararray ) = 36
> > > >> > >> > Lets say we incur an additional overhead 4X times this, which
> > is
> > > >> ~160
> > > >> > >> bytes
> > > >> > >> > per tuple.
> > > >> > >> > Total per Bag = 30000 X 160 = ~5 MB
> > > >> > >> >
> > > >> > >> > You could probably store the ipsegments as Tuple and test it
> on
> > > >> your
> > > >> > >> > servers.
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
> > > >> [email protected]>
> > > >> > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Do you have many such bags or just one? If one, and you
> want
> > to
> > > >> look
> > > >> > >> up
> > > >> > >> > > many ups in it, might be more efficient to serialize this
> > > >> relation
> > > >> > to
> > > >> > >> > hdfs,
> > > >> > >> > > and write a lookup udf that specifies the serialized data
> set
> > > as
> > > >> a
> > > >> > >> file
> > > >> > >> > to
> > > >> > >> > > put in distributed cache. At init time, load up the file
> into
> > > >> > memory,
> > > >> > >> > then
> > > >> > >> > > for every ip do the binary search in exec()
> > > >> > >> > >
> > > >> > >> > > On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]>
> wrote:
> > > >> > >> > >
> > > >> > >> > > > Thank you all!
> > > >> > >> > > >
> > > >> > >> > > > The detail is:
> > > >> > >> > > > A bag contains many "IP Segments", whose schema is
> > > >> (ipStart:long,
> > > >> > >> > > > ipEnd:long, locName:chararray) and the number of tuples
> is
> > > >> about
> > > >> > >> 30000,
> > > >> > >> > > > and I want to check wheather an IP is belong to one
> segment
> > > in
> > > >> the
> > > >> > >> bag.
> > > >> > >> > > >
> > > >> > >> > > > I want to order the "IP Segments" by (ipStart, ipEnd) in
> > MR,
> > > >> > >> > > > and then binary search wheather an IP is in the bag in
> UDF.
> > > >> > >> > > >
> > > >> > >> > > > If enumerate every IP, it will be more than 100000000
> > single
> > > >> IPs,
> > > >> > >> > > > I think it will also be time consuming by JOIN in PIG.
> > > >> > >> > > >
> > > >> > >> > > > Please help me how can I deal with it efficiently!
> > > >> > >> > > >
> > > >> > >> > > >
> > > >> > >> > > > 2011/12/14 Thejas Nair <[email protected]>
> > > >> > >> > > >
> > > >> > >> > > >> My assumption is that 唐亮 is trying to do binary search
> on
> > > bags
> > > >> > >> within
> > > >> > >> > > the
> > > >> > >> > > >> tuples in a relation (ie schema of the relation has a
> bag
> > > >> > column).
> > > >> > >> I
> > > >> > >> > > don't
> > > >> > >> > > >> think he is trying to treat the entire relation as one
> bag
> > > >> and do
> > > >> > >> > binary
> > > >> > >> > > >> search on that.
> > > >> > >> > > >>
> > > >> > >> > > >>
> > > >> > >> > > >> -Thejas
> > > >> > >> > > >>
> > > >> > >> > > >>
> > > >> > >> > > >>
> > > >> > >> > > >> On 12/13/11 2:30 PM, Andrew Wells wrote:
> > > >> > >> > > >>
> > > >> > >> > > >>> I don't think this could be done,
> > > >> > >> > > >>>
> > > >> > >> > > >>> pig is just a hadoop job, and the idea behind hadoop is
> > to
> > > >> read
> > > >> > >> all
> > > >> > >> > the
> > > >> > >> > > >>> data in a file.
> > > >> > >> > > >>>
> > > >> > >> > > >>> so by the time you put all the data into an array, you
> > > would
> > > >> > have
> > > >> > >> > been
> > > >> > >> > > >>> better off just checking each element for the one you
> > were
> > > >> > looking
> > > >> > >> > for.
> > > >> > >> > > >>>
> > > >> > >> > > >>> So what you would get is [n + lg (n)], which will just
> be
> > > [n]
> > > >> > >> after
> > > >> > >> > > >>> putting
> > > >> > >> > > >>> that into an array.
> > > >> > >> > > >>> Second, hadoop is all about large data analysis,
> usually
> > > more
> > > >> > than
> > > >> > >> > > 100GB,
> > > >> > >> > > >>> so putting this into memory is out of the question.
> > > >> > >> > > >>> Third, hadoop is efficient because it processes this
> > large
> > > >> > amount
> > > >> > >> of
> > > >> > >> > > data
> > > >> > >> > > >>> by splitting it up into multiple processes. To do an
> > > >> efficient
> > > >> > >> binary
> > > >> > >> > > >>> search, you would need do this in one mapper or one
> > > reducer.
> > > >> > >> > > >>>
> > > >> > >> > > >>> My opinion is just don't fight hadoop/pig.
> > > >> > >> > > >>>
> > > >> > >> > > >>>
> > > >> > >> > > >>>
> > > >> > >> > > >>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
> > > >> > >> [email protected]>
> > > >> > >> > > >>> wrote:
> > > >> > >> > > >>>
> > > >> > >> > > >>> Bags can be very large might not fit into memory, and
> in
> > > such
> > > >> > >> cases
> > > >> > >> > > some
> > > >> > >> > > >>>> or all of the bag might have to be stored on disk. In
> > such
> > > >> > >> cases, it
> > > >> > >> > > is
> > > >> > >> > > >>>> not
> > > >> > >> > > >>>> efficient to do random access on the bag. That is why
> > the
> > > >> > DataBag
> > > >> > >> > > >>>> interface
> > > >> > >> > > >>>> does not support it.
> > > >> > >> > > >>>>
> > > >> > >> > > >>>> As Prashant suggested, storing it in a tuple would be
> a
> > > good
> > > >> > >> > > alternative,
> > > >> > >> > > >>>> if you want to have random access to do binary search.
> > > >> > >> > > >>>>
> > > >> > >> > > >>>> -Thejas
> > > >> > >> > > >>>>
> > > >> > >> > > >>>>
> > > >> > >> > > >>>>
> > > >> > >> > > >>>> On 12/12/11 7:54 PM, 唐亮 wrote:
> > > >> > >> > > >>>>
> > > >> > >> > > >>>> Hi all,
> > > >> > >> > > >>>>> How can I implement a binary search in pig?
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>> In one relation, there exists a bag whose items are
> > > sorted.
> > > >> > >> > > >>>>> And I want to check there exists a specific item in
> the
> > > >> bag.
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>> In UDF, I can't random access items in DataBag
> > container.
> > > >> > >> > > >>>>> So I have to transfer the items in DataBag to an
> > > ArrayList,
> > > >> > and
> > > >> > >> > this
> > > >> > >> > > is
> > > >> > >> > > >>>>> time consuming.
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>> How can I implement the binary search efficiently in
> > pig?
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>>
> > > >> > >> > > >>>>
> > > >> > >> > > >>>
> > > >> > >> > > >>
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Implement Binary Search in PIG

Reply via email to