Re: Implement Binary Search in PIG

Dmitriy Ryaboy Sun, 18 Dec 2011 20:56:02 -0800

There's a very detailed write-up about this here:
http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html



2011/12/18 唐亮 <[email protected]>:
> Prashant Kommireddi，
> Thank you very much!
> And your code seems cool, especially the usage of  '*'.
>
> But, I'm still not very sure about the details.
>
> My PIG scripts are as below:
> *-- Load IP Segments*
> *raw_ip_segment = load ... *
> *ip_segs = foreach raw_ip_segment generate ipstart, ipend, name;*
> *group_ip_segs = group ip_segs all;*
> *
> *
> *order_ip_segs = foreach group_ip_segs {*
> *  order_seg = order ip_segs by ipstart, ipend;*
> *  generate 't' as tag, order_seg;*
> *}*
> *describe order_ip_segs*
> *order_ip_segs: {tag: chararray,order_seg: {ipstart: long,ipend: long,poid:
> chararray}}*
> *
> *
> *-- Load IP from LOG*
> *ip_log = load ...  *
> *ip_tag = foreach ip_log generate 't' as tag, ip;*
> *
> *
> *-- Join by tag*
> *join_ip_tag = join order_ip_segs by tag, ip_tag by tag;*
> *
> *
> *retain_ip_segs = foreach join_ip_tag generate ip_tag::ip as ip,
> order_ip_segs::order_seg as order_seg;*
> *-- ip: the ip I want to look up;*
> *-- order_seg: ordered ip segments used for BinarySearch*
>
>
> Can you show me the detailed followings?
> Such as the codes of UDF, and the PIG script to call the UDF.
>
>
> 在 2011年12月18日 下午6:17，Prashant Kommireddi <[email protected]>写道：
>
>> to_tuple = FOREACH order_ip_segs GENERATE tag, FLATTEN(order_seq);
>>
>> result = foreach totuple GetProvinceNameFromIPNum(toSearch, * );
>>
>>
>> 2011/12/18 唐亮 <[email protected]>
>>
>> > Prashant Kommireddi，
>> > How to call your UDF in PIG script?
>> >
>> > Thanks!
>> >
>> > 在 2011年12月16日 下午1:12，唐亮 <[email protected]>写道：
>> >
>> > > Thanks Prashant Kommireddi，
>> > >
>> > > But my question is:
>> > > How to call the UDF in PIG, especially the parameters to put into the
>> > UDF.
>> > >
>> > > 在 2011年12月15日 下午4:05，Prashant Kommireddi <[email protected]>写道：
>> > >
>> > > Not sure what you mean. Have you tried the code I forwarded? Are you
>> > facing
>> > >> any issues there?
>> > >>
>> > >> If your question is regarding binarySearch implementation, here is
>> > >> pseudo-code'ish implementation. I have not tested this, please treat
>> > this
>> > >> as a general idea on how to go about accessing the elements within the
>> > >> Tuple.
>> > >>
>> > >> ALSO, I am assuming you have defined schema for (inner) Tuple
>> contents.
>> > >>
>> > >> public String binarySearch(Tuple tuple, long toSearch, int low, int
>> > high)
>> > >> {
>> > >>  if(low > high)
>> > >>     return "NOT FOUND";    //Handle this the way you would like
>> > >>
>> > >>  if(tuple == null)
>> > >>    throw new IllegalArgumentException("Tuple is null");   //Handle
>> > >> this the way you would like
>> > >>
>> > >>  int mid = (low + high)/2;
>> > >>  Tuple midTuple = tuple.get(mid);
>> > >>  String tag = midTuple.get(0).toString();
>> > >>  long ipstart = (Long)midTuple.get(1);
>> > >>  long ipend = (Long)midTuple.get(2);
>> > >>  String loc = midTuple.get(3).toString();
>> > >>
>> > >>  if(toSearch == ipstart)  //Or ipend, I am not sure how you want to
>> > search
>> > >>  {
>> > >>    return loc;
>> > >>  }
>> > >>  else if(toSearch < ipstart)
>> > >>    return binarySearch(tuple, low, mid - 1);
>> > >>
>> > >>  else
>> > >>    return binarySearch(tuple, mid+1, high);
>> > >>
>> > >>  }
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> 2011/12/14 唐亮 <[email protected]>
>> > >>
>> > >> > Hi Prashant Kommireddi，
>> > >> >
>> > >> > If so, how should I write the UDF, especially the data types in UDF?
>> > >> >
>> > >> > 2011/12/15 Prashant Kommireddi <[email protected]>
>> > >> >
>> > >> > > When you flatten your BAG all your segments are within a single
>> > tuple.
>> > >> > > Something like
>> > >> > >
>> > >> > > ((tag, ipstart, ipend, loc), (tag, ipstart, ipend, loc)...(tagN,
>> > >> > > ipstartN, ipendN, locN))
>> > >> > >
>> > >> > > You can access the inner tuples positionally.
>> > >> > >
>> > >> > > Sent from my iPhone
>> > >> > >
>> > >> > > On Dec 14, 2011, at 6:28 PM, "唐亮" <[email protected]> wrote:
>> > >> > >
>> > >> > > > Now the question is:
>> > >> > > > How should I put all the "IP Segments" in one TUPLE?
>> > >> > > >
>> > >> > > > Please help me!
>> > >> > > >
>> > >> > > >
>> > >> > > > 2011/12/15 Prashant Kommireddi <[email protected]>
>> > >> > > >
>> > >> > > >> Michael,
>> > >> > > >>
>> > >> > > >> This would have no benefit over using a DistributedCache. For a
>> > >> large
>> > >> > > >> cluster this would mean poor performance. If the file is static
>> > and
>> > >> > > needs
>> > >> > > >> to be looked-up across the cluster, DistributedCache would be a
>> > >> better
>> > >> > > >> approach.
>> > >> > > >>
>> > >> > > >> Thanks,
>> > >> > > >> Prashant
>> > >> > > >>
>> > >> > > >> On Wed, Dec 14, 2011 at 11:18 AM, jiang licht <
>> > >> [email protected]>
>> > >> > > >> wrote:
>> > >> > > >>
>> > >> > > >>> If that list of ip pairs is pretty static most time and will
>> be
>> > >> used
>> > >> > > >>> frequently, maybe just copy it in hdfs with a high replication
>> > >> > factor.
>> > >> > > >> Then
>> > >> > > >>> use it as a look up table or some binary tree or treemap kind
>> of
>> > >> > thing
>> > >> > > by
>> > >> > > >>> reading it from hdfs instead of using distributed cache if
>> that
>> > >> > sounds
>> > >> > > an
>> > >> > > >>> easier thing to do.
>> > >> > > >>>
>> > >> > > >>>
>> > >> > > >>> Best regards,
>> > >> > > >>> Michael
>> > >> > > >>>
>> > >> > > >>>
>> > >> > > >>> ________________________________
>> > >> > > >>> From: Dmitriy Ryaboy <[email protected]>
>> > >> > > >>> To: [email protected]
>> > >> > > >>> Sent: Wednesday, December 14, 2011 10:28 AM
>> > >> > > >>> Subject: Re: Implement Binary Search in PIG
>> > >> > > >>>
>> > >> > > >>> hbase has nothing to do with distributed cache.
>> > >> > > >>>
>> > >> > > >>>
>> > >> > > >>> 2011/12/14 唐亮 <[email protected]>
>> > >> > > >>>
>> > >> > > >>>> Now, I didn't use HBase,
>> > >> > > >>>> so, maybe I can't use DistributedCache.
>> > >> > > >>>>
>> > >> > > >>>> And if FLATTEN DataBag, the results are Tuples,
>> > >> > > >>>> then in UDF I can process only one Tuple, which can't
>> implement
>> > >> > > >>>> BinarySearch.
>> > >> > > >>>>
>> > >> > > >>>> So, please help and show me the detailed solution.
>> > >> > > >>>> Thanks!
>> > >> > > >>>>
>> > >> > > >>>> 在 2011年12月14日 下午5:59，唐亮 <[email protected]>写道：
>> > >> > > >>>>
>> > >> > > >>>>> Hi Prashant Kommireddi，
>> > >> > > >>>>>
>> > >> > > >>>>> If I do 1. and 2. as you mentioned，
>> > >> > > >>>>> the schema will be {tag, ipStart, ipEnd, locName}.
>> > >> > > >>>>>
>> > >> > > >>>>> BUT, how should I write the UDF, especially how should I set
>> > the
>> > >> > type
>> > >> > > >>> of
>> > >> > > >>>>> the input parameter?
>> > >> > > >>>>>
>> > >> > > >>>>> Currently, the UDF codes are as below, whose input parameter
>> > is
>> > >> > > >>> DataBag:
>> > >> > > >>>>>
>> > >> > > >>>>> public class GetProvinceNameFromIPNum extends
>> > EvalFunc<String> {
>> > >> > > >>>>>
>> > >> > > >>>>>   public String exec(Tuple input) throws IOException {
>> > >> > > >>>>> if (input == null || input.size() == 0)
>> > >> > > >>>>>            return UnknownIP;
>> > >> > > >>>>> if (input.size() != 2) {
>> > >> > > >>>>>    throw new IOException("Expected input's size is 2, but
>> is:
>> > "
>> > >> +
>> > >> > > >>>>> input.size());
>> > >> > > >>>>>    }
>> > >> > > >>>>>
>> > >> > > >>>>>        Object o1 = input.get(0); * // This should be the IP
>> > you
>> > >> > want
>> > >> > > >>> to
>> > >> > > >>>>> look up*
>> > >> > > >>>>>        if (!(o1 instanceof Long)) {
>> > >> > > >>>>>            throw new IOException("Expected input 1 to be
>> Long,
>> > >> but
>> > >> > > >>> got "
>> > >> > > >>>>>            + o1.getClass().getName());
>> > >> > > >>>>>        }
>> > >> > > >>>>>        Object o2 = input.get(1);  *// This is the Bag of IP
>> > >> segs*
>> > >> > > >>>>>        if (!(o2 instanceof *DataBag*)) {  //* Should I
>> change
>> > >> it to
>> > >> > > >>> "(o2
>> > >> > > >>>>> instanceof Tuple)"?*
>> > >> > > >>>>>            throw new IOException("Expected input 2 to be
>> > >> DataBag,
>> > >> > > >> but
>> > >> > > >>>> got
>> > >> > > >>>>> "
>> > >> > > >>>>>            + o2.getClass().getName());
>> > >> > > >>>>>        }
>> > >> > > >>>>>
>> > >> > > >>>>>        ........... other codes ...........
>> > >> > > >>>>>   }
>> > >> > > >>>>>
>> > >> > > >>>>> }
>> > >> > > >>>>>
>> > >> > > >>>>>
>> > >> > > >>>>>
>> > >> > > >>>>> 在 2011年12月14日 下午3:16，Prashant Kommireddi <
>> [email protected]
>> > >> >写道：
>> > >> > > >>>>>
>> > >> > > >>>>> Seems like at the end of this you have a Single bag with all
>> > the
>> > >> > > >>>> elements,
>> > >> > > >>>>>> and somehow you would like to check whether an element
>> exists
>> > >> in
>> > >> > it
>> > >> > > >>>> based
>> > >> > > >>>>>> on ipstart/end.
>> > >> > > >>>>>>
>> > >> > > >>>>>>
>> > >> > > >>>>>>  1. Use FLATTEN
>> > >> > > >> http://pig.apache.org/docs/r0.9.1/basic.html#flatten-
>> > >> > > >>>>>>  this will convert the Bag to Tuple:  to_tuple = FOREACH
>> > >> > > >>> order_ip_segs
>> > >> > > >>>>>>  GENERATE tag, FLATTEN(order_seq); ---- This is O(n)
>> > >> > > >>>>>>  2. Now write a UDF that can access the elements
>> positionally
>> > >> for
>> > >> > > >> the
>> > >> > > >>>>>>  BinarySearch
>> > >> > > >>>>>>  3. Dmitriy and Jonathan's ideas with DistributedCache
>> could
>> > >> > > >> perform
>> > >> > > >>>>>>  better than the above approach, so you could go down that
>> > >> route
>> > >> > > >> too.
>> > >> > > >>>>>>
>> > >> > > >>>>>>
>> > >> > > >>>>>> 2011/12/13 唐亮 <[email protected]>
>> > >> > > >>>>>>
>> > >> > > >>>>>>> The detailed PIG codes are as below:
>> > >> > > >>>>>>>
>> > >> > > >>>>>>> raw_ip_segment = load ...
>> > >> > > >>>>>>> ip_segs = foreach raw_ip_segment generate ipstart, ipend,
>> > >> name;
>> > >> > > >>>>>>> group_ip_segs = group ip_segs all;
>> > >> > > >>>>>>>
>> > >> > > >>>>>>> order_ip_segs = foreach group_ip_segs {
>> > >> > > >>>>>>> order_seg = order ip_segs by ipstart, ipend;
>> > >> > > >>>>>>> generate 't' as tag, order_seg;
>> > >> > > >>>>>>> }
>> > >> > > >>>>>>> describe order_ip_segs
>> > >> > > >>>>>>> order_ip_segs: {tag: chararray,order_seg: {ipstart:
>> > >> long,ipend:
>> > >> > > >>>>>> long,poid:
>> > >> > > >>>>>>> chararray}}
>> > >> > > >>>>>>>
>> > >> > > >>>>>>> Here, the order_ip_segs::order_seg is a BAG,
>> > >> > > >>>>>>> how can I transer it to a TUPLE?
>> > >> > > >>>>>>>
>> > >> > > >>>>>>> And can I access the TUPLE randomly in UDF?
>> > >> > > >>>>>>>
>> > >> > > >>>>>>> 在 2011年12月14日 下午2:41，唐亮 <[email protected]>写道：
>> > >> > > >>>>>>>
>> > >> > > >>>>>>>> Then how can I transfer all the items in Bag to a Tuple?
>> > >> > > >>>>>>>>
>> > >> > > >>>>>>>>
>> > >> > > >>>>>>>> 2011/12/14 Jonathan Coveney <[email protected]>
>> > >> > > >>>>>>>>
>> > >> > > >>>>>>>>> It's funny, but if you look wayyyy in the past, I
>> actually
>> > >> > > >> asked
>> > >> > > >>> a
>> > >> > > >>>>>> bunch
>> > >> > > >>>>>>>>> of
>> > >> > > >>>>>>>>> questions that circled around, literally, this exact
>> > >> problem.
>> > >> > > >>>>>>>>>
>> > >> > > >>>>>>>>> Dmitriy and Prahsant are correct: the best way is to
>> make
>> > a
>> > >> UDF
>> > >> > > >>>> that
>> > >> > > >>>>>> can
>> > >> > > >>>>>>>>> do
>> > >> > > >>>>>>>>> the lookup really efficiently. This is what the maxmind
>> > API
>> > >> > > >> does,
>> > >> > > >>>> for
>> > >> > > >>>>>>>>> example.
>> > >> > > >>>>>>>>>
>> > >> > > >>>>>>>>> 2011/12/13 Prashant Kommireddi <[email protected]>
>> > >> > > >>>>>>>>>
>> > >> > > >>>>>>>>>> I am lost when you say "If enumerate every IP, it will
>> be
>> > >> > > >> more
>> > >> > > >>>> than
>> > >> > > >>>>>>>>>> 100000000 single IPs"
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>> If each bag is a collection of 30000 tuples it might
>> not
>> > be
>> > >> > > >> too
>> > >> > > >>>>>> bad on
>> > >> > > >>>>>>>>> the
>> > >> > > >>>>>>>>>> memory if you used Tuple to store segments instead?
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>> (8 bytes long + 8 bytes long + 20 bytes for chararray
>> ) =
>> > >> 36
>> > >> > > >>>>>>>>>> Lets say we incur an additional overhead 4X times this,
>> > >> which
>> > >> > > >>> is
>> > >> > > >>>>>> ~160
>> > >> > > >>>>>>>>> bytes
>> > >> > > >>>>>>>>>> per tuple.
>> > >> > > >>>>>>>>>> Total per Bag = 30000 X 160 = ~5 MB
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>> You could probably store the ipsegments as Tuple and
>> test
>> > >> it
>> > >> > > >> on
>> > >> > > >>>>>> your
>> > >> > > >>>>>>>>>> servers.
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>> On Tue, Dec 13, 2011 at 8:39 PM, Dmitriy Ryaboy <
>> > >> > > >>>>>> [email protected]>
>> > >> > > >>>>>>>>>> wrote:
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>>> Do you have many such bags or just one? If one, and
>> you
>> > >> > > >> want
>> > >> > > >>> to
>> > >> > > >>>>>> look
>> > >> > > >>>>>>>>> up
>> > >> > > >>>>>>>>>>> many ups in it, might be more efficient to serialize
>> > this
>> > >> > > >>>>>> relation
>> > >> > > >>>>>>> to
>> > >> > > >>>>>>>>>> hdfs,
>> > >> > > >>>>>>>>>>> and write a lookup udf that specifies the serialized
>> > data
>> > >> > > >> set
>> > >> > > >>>> as
>> > >> > > >>>>>> a
>> > >> > > >>>>>>>>> file
>> > >> > > >>>>>>>>>> to
>> > >> > > >>>>>>>>>>> put in distributed cache. At init time, load up the
>> file
>> > >> > > >> into
>> > >> > > >>>>>>> memory,
>> > >> > > >>>>>>>>>> then
>> > >> > > >>>>>>>>>>> for every ip do the binary search in exec()
>> > >> > > >>>>>>>>>>>
>> > >> > > >>>>>>>>>>> On Dec 13, 2011, at 7:55 PM, 唐亮 <[email protected]>
>> > >> > > >> wrote:
>> > >> > > >>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> Thank you all!
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> The detail is:
>> > >> > > >>>>>>>>>>>> A bag contains many "IP Segments", whose schema is
>> > >> > > >>>>>> (ipStart:long,
>> > >> > > >>>>>>>>>>>> ipEnd:long, locName:chararray) and the number of
>> tuples
>> > >> > > >> is
>> > >> > > >>>>>> about
>> > >> > > >>>>>>>>> 30000,
>> > >> > > >>>>>>>>>>>> and I want to check wheather an IP is belong to one
>> > >> > > >> segment
>> > >> > > >>>> in
>> > >> > > >>>>>> the
>> > >> > > >>>>>>>>> bag.
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> I want to order the "IP Segments" by (ipStart, ipEnd)
>> > in
>> > >> > > >>> MR,
>> > >> > > >>>>>>>>>>>> and then binary search wheather an IP is in the bag
>> in
>> > >> > > >> UDF.
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> If enumerate every IP, it will be more than 100000000
>> > >> > > >>> single
>> > >> > > >>>>>> IPs,
>> > >> > > >>>>>>>>>>>> I think it will also be time consuming by JOIN in
>> PIG.
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> Please help me how can I deal with it efficiently!
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>> 2011/12/14 Thejas Nair <[email protected]>
>> > >> > > >>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>> My assumption is that 唐亮 is trying to do binary
>> search
>> > >> > > >> on
>> > >> > > >>>> bags
>> > >> > > >>>>>>>>> within
>> > >> > > >>>>>>>>>>> the
>> > >> > > >>>>>>>>>>>>> tuples in a relation (ie schema of the relation has
>> a
>> > >> > > >> bag
>> > >> > > >>>>>>> column).
>> > >> > > >>>>>>>>> I
>> > >> > > >>>>>>>>>>> don't
>> > >> > > >>>>>>>>>>>>> think he is trying to treat the entire relation as
>> one
>> > >> > > >> bag
>> > >> > > >>>>>> and do
>> > >> > > >>>>>>>>>> binary
>> > >> > > >>>>>>>>>>>>> search on that.
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>> -Thejas
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>> On 12/13/11 2:30 PM, Andrew Wells wrote:
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> I don't think this could be done,
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> pig is just a hadoop job, and the idea behind
>> hadoop
>> > is
>> > >> > > >>> to
>> > >> > > >>>>>> read
>> > >> > > >>>>>>>>> all
>> > >> > > >>>>>>>>>> the
>> > >> > > >>>>>>>>>>>>>> data in a file.
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> so by the time you put all the data into an array,
>> > you
>> > >> > > >>>> would
>> > >> > > >>>>>>> have
>> > >> > > >>>>>>>>>> been
>> > >> > > >>>>>>>>>>>>>> better off just checking each element for the one
>> you
>> > >> > > >>> were
>> > >> > > >>>>>>> looking
>> > >> > > >>>>>>>>>> for.
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> So what you would get is [n + lg (n)], which will
>> > just
>> > >> > > >> be
>> > >> > > >>>> [n]
>> > >> > > >>>>>>>>> after
>> > >> > > >>>>>>>>>>>>>> putting
>> > >> > > >>>>>>>>>>>>>> that into an array.
>> > >> > > >>>>>>>>>>>>>> Second, hadoop is all about large data analysis,
>> > >> > > >> usually
>> > >> > > >>>> more
>> > >> > > >>>>>>> than
>> > >> > > >>>>>>>>>>> 100GB,
>> > >> > > >>>>>>>>>>>>>> so putting this into memory is out of the question.
>> > >> > > >>>>>>>>>>>>>> Third, hadoop is efficient because it processes
>> this
>> > >> > > >>> large
>> > >> > > >>>>>>> amount
>> > >> > > >>>>>>>>> of
>> > >> > > >>>>>>>>>>> data
>> > >> > > >>>>>>>>>>>>>> by splitting it up into multiple processes. To do
>> an
>> > >> > > >>>>>> efficient
>> > >> > > >>>>>>>>> binary
>> > >> > > >>>>>>>>>>>>>> search, you would need do this in one mapper or one
>> > >> > > >>>> reducer.
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> My opinion is just don't fight hadoop/pig.
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> On Tue, Dec 13, 2011 at 1:56 PM, Thejas Nair<
>> > >> > > >>>>>>>>> [email protected]>
>> > >> > > >>>>>>>>>>>>>> wrote:
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>> Bags can be very large might not fit into memory,
>> and
>> > >> > > >> in
>> > >> > > >>>> such
>> > >> > > >>>>>>>>> cases
>> > >> > > >>>>>>>>>>> some
>> > >> > > >>>>>>>>>>>>>>> or all of the bag might have to be stored on disk.
>> > In
>> > >> > > >>> such
>> > >> > > >>>>>>>>> cases, it
>> > >> > > >>>>>>>>>>> is
>> > >> > > >>>>>>>>>>>>>>> not
>> > >> > > >>>>>>>>>>>>>>> efficient to do random access on the bag. That is
>> > why
>> > >> > > >>> the
>> > >> > > >>>>>>> DataBag
>> > >> > > >>>>>>>>>>>>>>> interface
>> > >> > > >>>>>>>>>>>>>>> does not support it.
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>> As Prashant suggested, storing it in a tuple would
>> > be
>> > >> > > >> a
>> > >> > > >>>> good
>> > >> > > >>>>>>>>>>> alternative,
>> > >> > > >>>>>>>>>>>>>>> if you want to have random access to do binary
>> > search.
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>> -Thejas
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>> On 12/12/11 7:54 PM, 唐亮 wrote:
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>> Hi all,
>> > >> > > >>>>>>>>>>>>>>>> How can I implement a binary search in pig?
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>> In one relation, there exists a bag whose items
>> are
>> > >> > > >>>> sorted.
>> > >> > > >>>>>>>>>>>>>>>> And I want to check there exists a specific item
>> in
>> > >> > > >> the
>> > >> > > >>>>>> bag.
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>> In UDF, I can't random access items in DataBag
>> > >> > > >>> container.
>> > >> > > >>>>>>>>>>>>>>>> So I have to transfer the items in DataBag to an
>> > >> > > >>>> ArrayList,
>> > >> > > >>>>>>> and
>> > >> > > >>>>>>>>>> this
>> > >> > > >>>>>>>>>>> is
>> > >> > > >>>>>>>>>>>>>>>> time consuming.
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>> How can I implement the binary search efficiently
>> > in
>> > >> > > >>> pig?
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>>>
>> > >> > > >>>>>>>>>>>
>> > >> > > >>>>>>>>>>
>> > >> > > >>>>>>>>>
>> > >> > > >>>>>>>>
>> > >> > > >>>>>>>>
>> > >> > > >>>>>>>
>> > >> > > >>>>>>
>> > >> > > >>>>>
>> > >> > > >>>>>
>> > >> > > >>>>
>> > >> > > >>>
>> > >> > > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Re: Implement Binary Search in PIG

Reply via email to